SimpleCrawl
Back to Blog
Web ScrapingAnti-DetectionProxies

Web Scraping Without Getting Blocked: 12 Techniques That Work

Stop getting 403s and CAPTCHAs. Learn 12 proven techniques for web scraping without getting blocked — from header rotation to residential proxies to behavioral mimicry.

SimpleCrawl TeamMarch 1, 202611 min read

Why Websites Block Scrapers (and How to Stop It)

Web scraping without getting blocked is the single biggest challenge developers face when extracting data at scale. You write a scraper that works perfectly in testing, deploy it, and within hours you're drowning in 403 errors, CAPTCHAs, and IP bans.

Quick answer: Websites block scrapers by analyzing request patterns, headers, IP reputation, and browser fingerprints. The 12 techniques below address each detection vector — from simple header rotation to advanced behavioral mimicry.

Understanding why you're being blocked is the first step to avoiding it. Modern anti-bot systems like Cloudflare, DataDome, PerimeterX, and Akamai Bot Manager use multiple signals simultaneously:

  • IP reputation — Is this IP address known for automated traffic?
  • Request patterns — Are requests coming too fast, too regularly, or in unnatural sequences?
  • HTTP headers — Do the headers match a real browser, or are they missing/wrong?
  • TLS fingerprint — Does the TLS handshake match the claimed browser?
  • JavaScript execution — Can the client run JavaScript challenges?
  • Browser fingerprint — Does the browser environment look legitimate?
  • Behavioral analysis — Does the user interact like a human (mouse movements, scroll patterns)?

No single technique defeats all of these. Effective anti-detection requires a layered approach. Here are 12 techniques, ordered from simplest to most advanced.

1. Use Realistic HTTP Headers

The most basic detection method is checking HTTP headers. Many scrapers use default libraries that send obviously non-browser headers.

Bad — default Python requests headers:

# requests sends: "python-requests/2.31.0" as User-Agent
response = requests.get("https://target.com")

Good — full browser header set:

import requests

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1",
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "none",
    "Sec-Fetch-User": "?1",
    "Cache-Control": "max-age=0",
}

response = requests.get("https://target.com", headers=headers)

The Sec-Fetch-* headers are particularly important — they were introduced specifically to distinguish navigation requests from programmatic fetches, and many anti-bot systems check for them.

2. Rotate User Agents

Using a single User-Agent for all requests is a red flag. Rotate through a realistic set:

import random

USER_AGENTS = [
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.3 Safari/605.1.15",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:123.0) Gecko/20100101 Firefox/123.0",
]

def get_headers():
    return {
        "User-Agent": random.choice(USER_AGENTS),
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
    }

Keep your User-Agent list up to date. Chrome 90 in 2026 is a dead giveaway that the request isn't from a real browser.

3. Implement Request Rate Limiting

Scraping at machine speed is the most common reason for blocks. No human browses 100 pages per second.

import time
import random

def polite_scrape(urls: list[str], min_delay: float = 1.0, max_delay: float = 3.0):
    """Scrape with randomized delays between requests."""
    results = []

    for url in urls:
        response = requests.get(url, headers=get_headers())
        results.append(response)

        # Randomized delay to avoid pattern detection
        delay = random.uniform(min_delay, max_delay)
        time.sleep(delay)

    return results

Rate Limiting Guidelines

Site TypeRecommended DelayMax Concurrent
Small blogs/personal sites3-5 seconds1
Medium business sites1-3 seconds2-3
Large platforms (Amazon, etc.)2-5 seconds1-2
Public APIs with rate limitsFollow their limitsAs allowed

Random delays are critical. A perfectly regular 2.0-second interval between requests looks just as automated as no delay at all.

4. Use Proxy Rotation

If you're sending all requests from a single IP address, the site only needs to block one IP to stop you. Proxy rotation distributes requests across many IPs.

Types of Proxies

Datacenter proxies — Cheap ($1-5/GB) but easily detected. IP ranges are known to belong to cloud providers.

Residential proxies — More expensive ($5-15/GB) but much harder to detect. These are real ISP IPs from real households.

Mobile proxies — Most expensive ($20-40/GB) but nearly impossible to block. Blocking a mobile carrier IP range would block real users.

import itertools

PROXIES = [
    "http://user:pass@proxy1.example.com:8080",
    "http://user:pass@proxy2.example.com:8080",
    "http://user:pass@proxy3.example.com:8080",
]

proxy_cycle = itertools.cycle(PROXIES)

def scrape_with_proxy(url: str) -> requests.Response:
    proxy = next(proxy_cycle)
    return requests.get(
        url,
        headers=get_headers(),
        proxies={"http": proxy, "https": proxy},
        timeout=30,
    )

Sticky Sessions vs Rotating

Use rotating proxies (new IP per request) for scraping different pages across a site. Use sticky sessions (same IP for a period) when you need to maintain login state or navigate multi-page flows.

5. Respect robots.txt (Strategically)

The robots.txt file tells crawlers which pages they're allowed to access. Respecting it keeps you on the right side of terms of service and reduces the chance of blocks:

from urllib.robotparser import RobotFileParser

def check_robots(url: str, user_agent: str = "*") -> bool:
    """Check if a URL is allowed by robots.txt."""
    from urllib.parse import urlparse
    parsed = urlparse(url)
    robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"

    rp = RobotFileParser()
    rp.set_url(robots_url)
    rp.read()

    return rp.can_fetch(user_agent, url)

Also check for Crawl-delay directives in robots.txt — some sites specify a minimum delay between requests.

6. Handle Cookies and Sessions Properly

Anti-bot systems set cookies to track and challenge visitors. Discarding cookies between requests looks suspicious.

import requests

session = requests.Session()

# First request: get cookies
session.get("https://target.com", headers=get_headers())

# Subsequent requests: cookies are automatically included
response = session.get("https://target.com/products", headers=get_headers())

Some sites set JavaScript cookies — cookies that can only be created by executing a JavaScript snippet. These require a real browser environment to handle.

7. Rotate TLS Fingerprints

Advanced anti-bot systems fingerprint the TLS handshake itself. Different HTTP libraries produce different TLS signatures, and these can be compared against the claimed User-Agent.

Python's requests library has a distinctive TLS fingerprint that doesn't match any real browser. Libraries like curl_cffi and tls-client solve this:

from curl_cffi import requests as cf_requests

# Impersonates Chrome's TLS fingerprint
response = cf_requests.get(
    "https://target.com",
    impersonate="chrome",
)

This matches Chrome's cipher suites, extensions, and handshake order — making the request indistinguishable from real Chrome traffic at the TLS layer.

8. Handle CAPTCHAs Gracefully

When you hit a CAPTCHA, you have several options:

Retry with a different IP — Often the CAPTCHA is triggered by IP reputation, not behavior. A new IP from a residential proxy may bypass it.

CAPTCHA solving services — Services like 2Captcha and Anti-Captcha use human workers to solve CAPTCHAs. Response times are 10-30 seconds and costs are $1-3 per 1000 solves.

Avoid them entirely — The best CAPTCHA strategy is not triggering one. All the other techniques in this guide reduce CAPTCHA frequency.

def scrape_with_captcha_retry(url: str, max_retries: int = 3) -> str:
    for attempt in range(max_retries):
        proxy = next(proxy_cycle)
        response = requests.get(
            url,
            headers=get_headers(),
            proxies={"http": proxy, "https": proxy},
        )

        if response.status_code == 200 and "captcha" not in response.text.lower():
            return response.text

        time.sleep(random.uniform(2, 5))

    raise Exception(f"Failed to scrape {url} after {max_retries} attempts")

9. Mimic Real Browser Navigation Patterns

Real users don't jump directly to deep product pages. They land on the homepage, browse categories, and navigate through results. Mimicking this pattern can bypass behavioral analysis:

def mimic_browsing_session(target_urls: list[str]) -> list:
    session = requests.Session()
    results = []

    # Start with the homepage
    session.get("https://target.com", headers=get_headers())
    time.sleep(random.uniform(2, 4))

    for url in target_urls:
        response = session.get(url, headers={
            **get_headers(),
            "Referer": "https://target.com",
        })
        results.append(response)
        time.sleep(random.uniform(1, 3))

    return results

The Referer header is important — it tells the server where the request came from. A direct request to a product page with no Referer looks like bot traffic.

10. Use Headless Browsers with Stealth Plugins

When simple HTTP requests aren't enough, headless browsers with stealth modifications can bypass JavaScript challenges:

from playwright.sync_api import sync_playwright

def stealth_scrape(url: str) -> str:
    with sync_playwright() as p:
        browser = p.chromium.launch(
            headless=True,
            args=[
                "--disable-blink-features=AutomationControlled",
                "--no-sandbox",
            ]
        )

        context = browser.new_context(
            user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
            viewport={"width": 1920, "height": 1080},
            locale="en-US",
            timezone_id="America/New_York",
        )

        page = context.new_page()

        # Override automation detection
        page.add_init_script("""
            Object.defineProperty(navigator, 'webdriver', { get: () => undefined });
            Object.defineProperty(navigator, 'plugins', { get: () => [1, 2, 3] });
            window.chrome = { runtime: {} };
        """)

        page.goto(url, wait_until="networkidle")
        content = page.content()
        browser.close()

    return content

This bypasses basic detection, but advanced systems like DataDome and PerimeterX check dozens of signals that are hard to fake consistently.

11. Distribute Requests Geographically

Anti-bot systems track the geographic distribution of requests. Getting 1000 requests from the same city within an hour is suspicious. Using proxies in multiple regions makes traffic look more natural:

GEO_PROXIES = {
    "us": "http://user:pass@us-proxy.example.com:8080",
    "uk": "http://user:pass@uk-proxy.example.com:8080",
    "de": "http://user:pass@de-proxy.example.com:8080",
    "jp": "http://user:pass@jp-proxy.example.com:8080",
}

def scrape_with_geo_rotation(url: str) -> requests.Response:
    region = random.choice(list(GEO_PROXIES.keys()))
    proxy = GEO_PROXIES[region]

    # Match Accept-Language to the proxy region
    lang_map = {"us": "en-US,en;q=0.9", "uk": "en-GB,en;q=0.9", "de": "de-DE,de;q=0.9", "jp": "ja-JP,ja;q=0.9"}

    headers = get_headers()
    headers["Accept-Language"] = lang_map.get(region, "en-US,en;q=0.9")

    return requests.get(url, headers=headers, proxies={"https": proxy})

Match your Accept-Language header and timezone to your proxy's location. A request claiming to be from Tokyo with a US English Accept-Language raises flags.

12. Use a Managed Scraping API

The most effective anti-blocking technique is to not build anti-detection yourself. A managed scraping API bundles all of the above techniques — and keeps them updated as anti-bot systems evolve.

SimpleCrawl handles the full anti-blocking stack:

import requests

# All anti-detection is handled automatically
response = requests.post(
    "https://api.simplecrawl.com/v1/scrape",
    headers={"Authorization": "Bearer sc_your_api_key"},
    json={
        "url": "https://heavily-protected-site.com/products",
        "format": "markdown"
    }
)

# Clean data, no blocks
data = response.json()
print(data["markdown"])

Under the hood, SimpleCrawl manages:

  • Residential proxy rotation across 195+ countries
  • TLS fingerprint matching per browser identity
  • Automatic CAPTCHA solving
  • Cookie and session management
  • JavaScript rendering with stealth browser profiles
  • Automatic retry with different proxy/fingerprint combinations

This is the approach recommended for production scrapers. Building and maintaining your own anti-detection infrastructure is a full-time job — and the anti-bot companies are updating their detection faster than you can update your evasion.

Anti-Detection Checklist

Before deploying a scraper to production, verify each layer:

  • Headers include a recent, realistic User-Agent
  • Sec-Fetch-* headers are present and correct
  • User-Agent rotation is enabled
  • Request delays are randomized (not fixed intervals)
  • Proxy rotation is configured (residential for sensitive targets)
  • TLS fingerprint matches the claimed User-Agent
  • Cookies are preserved across requests in a session
  • Referer header is set correctly
  • robots.txt has been reviewed
  • CAPTCHA retry logic is implemented
  • Geographic distribution matches target audience
  • Error handling doesn't retry infinitely (which looks like a DDoS)

Ethical Considerations

Scraping responsibly means:

  • Respect rate limits — Don't overload servers. Your scraper is a guest on their infrastructure.
  • Check Terms of Service — Some sites explicitly prohibit scraping. Know the legal landscape in your jurisdiction.
  • Don't scrape personal data — GDPR, CCPA, and similar regulations apply to scraped data too.
  • Identify yourself — For large-scale crawling, consider adding contact info to your User-Agent string.
  • Cache aggressively — Don't re-scrape pages that haven't changed. It's wasteful for you and the target.

FAQ

Why am I getting 403 Forbidden when scraping?

A 403 means the server is actively blocking your request. The most common causes are: missing or suspicious User-Agent header, IP address blacklisting, missing cookies from a prior JavaScript challenge, or your TLS fingerprint not matching the claimed browser. Start by fixing headers and adding proxy rotation.

How do I scrape Cloudflare-protected websites?

Cloudflare uses JavaScript challenges, CAPTCHAs, and browser fingerprinting. To bypass Cloudflare: use a real browser (Playwright with stealth), use residential proxies, solve the JavaScript challenge on first visit and preserve the cf_clearance cookie, or use a scraping API like SimpleCrawl that handles Cloudflare automatically.

In the US, the LinkedIn v. hiQ Labs case established that scraping publicly available data is generally legal. However, violating a site's Terms of Service could expose you to breach-of-contract claims. EU law is more nuanced with GDPR implications. Consult a lawyer for your specific use case.

How many requests per second can I send without getting blocked?

There's no universal answer — it depends on the target site's anti-bot sensitivity. Conservative starting points: 1 request every 2-3 seconds for a single IP, or 5-10 requests per second across a pool of 50+ rotating proxies. Monitor for 429 (Too Many Requests) and 403 responses and back off when they appear.

Do I need residential proxies or are datacenter proxies enough?

For sites without aggressive anti-bot protection (small businesses, blogs, government sites), datacenter proxies work fine. For large commercial sites (Amazon, LinkedIn, Google) and sites using Cloudflare, DataDome, or PerimeterX, residential proxies are essentially required. Mobile proxies are the nuclear option for the most heavily protected targets.

Ready to try SimpleCrawl?

We're building the simplest web scraping API for AI. Join the waitlist and get 500 free credits at launch.

Get early access + 500 free credits