Web Scraping Without Getting Blocked: 12 Techniques That Work
Stop getting 403s and CAPTCHAs. Learn 12 proven techniques for web scraping without getting blocked — from header rotation to residential proxies to behavioral mimicry.
Why Websites Block Scrapers (and How to Stop It)
Web scraping without getting blocked is the single biggest challenge developers face when extracting data at scale. You write a scraper that works perfectly in testing, deploy it, and within hours you're drowning in 403 errors, CAPTCHAs, and IP bans.
Quick answer: Websites block scrapers by analyzing request patterns, headers, IP reputation, and browser fingerprints. The 12 techniques below address each detection vector — from simple header rotation to advanced behavioral mimicry.
Understanding why you're being blocked is the first step to avoiding it. Modern anti-bot systems like Cloudflare, DataDome, PerimeterX, and Akamai Bot Manager use multiple signals simultaneously:
- IP reputation — Is this IP address known for automated traffic?
- Request patterns — Are requests coming too fast, too regularly, or in unnatural sequences?
- HTTP headers — Do the headers match a real browser, or are they missing/wrong?
- TLS fingerprint — Does the TLS handshake match the claimed browser?
- JavaScript execution — Can the client run JavaScript challenges?
- Browser fingerprint — Does the browser environment look legitimate?
- Behavioral analysis — Does the user interact like a human (mouse movements, scroll patterns)?
No single technique defeats all of these. Effective anti-detection requires a layered approach. Here are 12 techniques, ordered from simplest to most advanced.
1. Use Realistic HTTP Headers
The most basic detection method is checking HTTP headers. Many scrapers use default libraries that send obviously non-browser headers.
Bad — default Python requests headers:
# requests sends: "python-requests/2.31.0" as User-Agent
response = requests.get("https://target.com")
Good — full browser header set:
import requests
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Cache-Control": "max-age=0",
}
response = requests.get("https://target.com", headers=headers)
The Sec-Fetch-* headers are particularly important — they were introduced specifically to distinguish navigation requests from programmatic fetches, and many anti-bot systems check for them.
2. Rotate User Agents
Using a single User-Agent for all requests is a red flag. Rotate through a realistic set:
import random
USER_AGENTS = [
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.3 Safari/605.1.15",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:123.0) Gecko/20100101 Firefox/123.0",
]
def get_headers():
return {
"User-Agent": random.choice(USER_AGENTS),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
}
Keep your User-Agent list up to date. Chrome 90 in 2026 is a dead giveaway that the request isn't from a real browser.
3. Implement Request Rate Limiting
Scraping at machine speed is the most common reason for blocks. No human browses 100 pages per second.
import time
import random
def polite_scrape(urls: list[str], min_delay: float = 1.0, max_delay: float = 3.0):
"""Scrape with randomized delays between requests."""
results = []
for url in urls:
response = requests.get(url, headers=get_headers())
results.append(response)
# Randomized delay to avoid pattern detection
delay = random.uniform(min_delay, max_delay)
time.sleep(delay)
return results
Rate Limiting Guidelines
| Site Type | Recommended Delay | Max Concurrent |
|---|---|---|
| Small blogs/personal sites | 3-5 seconds | 1 |
| Medium business sites | 1-3 seconds | 2-3 |
| Large platforms (Amazon, etc.) | 2-5 seconds | 1-2 |
| Public APIs with rate limits | Follow their limits | As allowed |
Random delays are critical. A perfectly regular 2.0-second interval between requests looks just as automated as no delay at all.
4. Use Proxy Rotation
If you're sending all requests from a single IP address, the site only needs to block one IP to stop you. Proxy rotation distributes requests across many IPs.
Types of Proxies
Datacenter proxies — Cheap ($1-5/GB) but easily detected. IP ranges are known to belong to cloud providers.
Residential proxies — More expensive ($5-15/GB) but much harder to detect. These are real ISP IPs from real households.
Mobile proxies — Most expensive ($20-40/GB) but nearly impossible to block. Blocking a mobile carrier IP range would block real users.
import itertools
PROXIES = [
"http://user:pass@proxy1.example.com:8080",
"http://user:pass@proxy2.example.com:8080",
"http://user:pass@proxy3.example.com:8080",
]
proxy_cycle = itertools.cycle(PROXIES)
def scrape_with_proxy(url: str) -> requests.Response:
proxy = next(proxy_cycle)
return requests.get(
url,
headers=get_headers(),
proxies={"http": proxy, "https": proxy},
timeout=30,
)
Sticky Sessions vs Rotating
Use rotating proxies (new IP per request) for scraping different pages across a site. Use sticky sessions (same IP for a period) when you need to maintain login state or navigate multi-page flows.
5. Respect robots.txt (Strategically)
The robots.txt file tells crawlers which pages they're allowed to access. Respecting it keeps you on the right side of terms of service and reduces the chance of blocks:
from urllib.robotparser import RobotFileParser
def check_robots(url: str, user_agent: str = "*") -> bool:
"""Check if a URL is allowed by robots.txt."""
from urllib.parse import urlparse
parsed = urlparse(url)
robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
rp = RobotFileParser()
rp.set_url(robots_url)
rp.read()
return rp.can_fetch(user_agent, url)
Also check for Crawl-delay directives in robots.txt — some sites specify a minimum delay between requests.
6. Handle Cookies and Sessions Properly
Anti-bot systems set cookies to track and challenge visitors. Discarding cookies between requests looks suspicious.
import requests
session = requests.Session()
# First request: get cookies
session.get("https://target.com", headers=get_headers())
# Subsequent requests: cookies are automatically included
response = session.get("https://target.com/products", headers=get_headers())
Some sites set JavaScript cookies — cookies that can only be created by executing a JavaScript snippet. These require a real browser environment to handle.
7. Rotate TLS Fingerprints
Advanced anti-bot systems fingerprint the TLS handshake itself. Different HTTP libraries produce different TLS signatures, and these can be compared against the claimed User-Agent.
Python's requests library has a distinctive TLS fingerprint that doesn't match any real browser. Libraries like curl_cffi and tls-client solve this:
from curl_cffi import requests as cf_requests
# Impersonates Chrome's TLS fingerprint
response = cf_requests.get(
"https://target.com",
impersonate="chrome",
)
This matches Chrome's cipher suites, extensions, and handshake order — making the request indistinguishable from real Chrome traffic at the TLS layer.
8. Handle CAPTCHAs Gracefully
When you hit a CAPTCHA, you have several options:
Retry with a different IP — Often the CAPTCHA is triggered by IP reputation, not behavior. A new IP from a residential proxy may bypass it.
CAPTCHA solving services — Services like 2Captcha and Anti-Captcha use human workers to solve CAPTCHAs. Response times are 10-30 seconds and costs are $1-3 per 1000 solves.
Avoid them entirely — The best CAPTCHA strategy is not triggering one. All the other techniques in this guide reduce CAPTCHA frequency.
def scrape_with_captcha_retry(url: str, max_retries: int = 3) -> str:
for attempt in range(max_retries):
proxy = next(proxy_cycle)
response = requests.get(
url,
headers=get_headers(),
proxies={"http": proxy, "https": proxy},
)
if response.status_code == 200 and "captcha" not in response.text.lower():
return response.text
time.sleep(random.uniform(2, 5))
raise Exception(f"Failed to scrape {url} after {max_retries} attempts")
9. Mimic Real Browser Navigation Patterns
Real users don't jump directly to deep product pages. They land on the homepage, browse categories, and navigate through results. Mimicking this pattern can bypass behavioral analysis:
def mimic_browsing_session(target_urls: list[str]) -> list:
session = requests.Session()
results = []
# Start with the homepage
session.get("https://target.com", headers=get_headers())
time.sleep(random.uniform(2, 4))
for url in target_urls:
response = session.get(url, headers={
**get_headers(),
"Referer": "https://target.com",
})
results.append(response)
time.sleep(random.uniform(1, 3))
return results
The Referer header is important — it tells the server where the request came from. A direct request to a product page with no Referer looks like bot traffic.
10. Use Headless Browsers with Stealth Plugins
When simple HTTP requests aren't enough, headless browsers with stealth modifications can bypass JavaScript challenges:
from playwright.sync_api import sync_playwright
def stealth_scrape(url: str) -> str:
with sync_playwright() as p:
browser = p.chromium.launch(
headless=True,
args=[
"--disable-blink-features=AutomationControlled",
"--no-sandbox",
]
)
context = browser.new_context(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
viewport={"width": 1920, "height": 1080},
locale="en-US",
timezone_id="America/New_York",
)
page = context.new_page()
# Override automation detection
page.add_init_script("""
Object.defineProperty(navigator, 'webdriver', { get: () => undefined });
Object.defineProperty(navigator, 'plugins', { get: () => [1, 2, 3] });
window.chrome = { runtime: {} };
""")
page.goto(url, wait_until="networkidle")
content = page.content()
browser.close()
return content
This bypasses basic detection, but advanced systems like DataDome and PerimeterX check dozens of signals that are hard to fake consistently.
11. Distribute Requests Geographically
Anti-bot systems track the geographic distribution of requests. Getting 1000 requests from the same city within an hour is suspicious. Using proxies in multiple regions makes traffic look more natural:
GEO_PROXIES = {
"us": "http://user:pass@us-proxy.example.com:8080",
"uk": "http://user:pass@uk-proxy.example.com:8080",
"de": "http://user:pass@de-proxy.example.com:8080",
"jp": "http://user:pass@jp-proxy.example.com:8080",
}
def scrape_with_geo_rotation(url: str) -> requests.Response:
region = random.choice(list(GEO_PROXIES.keys()))
proxy = GEO_PROXIES[region]
# Match Accept-Language to the proxy region
lang_map = {"us": "en-US,en;q=0.9", "uk": "en-GB,en;q=0.9", "de": "de-DE,de;q=0.9", "jp": "ja-JP,ja;q=0.9"}
headers = get_headers()
headers["Accept-Language"] = lang_map.get(region, "en-US,en;q=0.9")
return requests.get(url, headers=headers, proxies={"https": proxy})
Match your Accept-Language header and timezone to your proxy's location. A request claiming to be from Tokyo with a US English Accept-Language raises flags.
12. Use a Managed Scraping API
The most effective anti-blocking technique is to not build anti-detection yourself. A managed scraping API bundles all of the above techniques — and keeps them updated as anti-bot systems evolve.
SimpleCrawl handles the full anti-blocking stack:
import requests
# All anti-detection is handled automatically
response = requests.post(
"https://api.simplecrawl.com/v1/scrape",
headers={"Authorization": "Bearer sc_your_api_key"},
json={
"url": "https://heavily-protected-site.com/products",
"format": "markdown"
}
)
# Clean data, no blocks
data = response.json()
print(data["markdown"])
Under the hood, SimpleCrawl manages:
- Residential proxy rotation across 195+ countries
- TLS fingerprint matching per browser identity
- Automatic CAPTCHA solving
- Cookie and session management
- JavaScript rendering with stealth browser profiles
- Automatic retry with different proxy/fingerprint combinations
This is the approach recommended for production scrapers. Building and maintaining your own anti-detection infrastructure is a full-time job — and the anti-bot companies are updating their detection faster than you can update your evasion.
Anti-Detection Checklist
Before deploying a scraper to production, verify each layer:
- Headers include a recent, realistic User-Agent
-
Sec-Fetch-*headers are present and correct - User-Agent rotation is enabled
- Request delays are randomized (not fixed intervals)
- Proxy rotation is configured (residential for sensitive targets)
- TLS fingerprint matches the claimed User-Agent
- Cookies are preserved across requests in a session
-
Refererheader is set correctly -
robots.txthas been reviewed - CAPTCHA retry logic is implemented
- Geographic distribution matches target audience
- Error handling doesn't retry infinitely (which looks like a DDoS)
Ethical Considerations
Scraping responsibly means:
- Respect rate limits — Don't overload servers. Your scraper is a guest on their infrastructure.
- Check Terms of Service — Some sites explicitly prohibit scraping. Know the legal landscape in your jurisdiction.
- Don't scrape personal data — GDPR, CCPA, and similar regulations apply to scraped data too.
- Identify yourself — For large-scale crawling, consider adding contact info to your User-Agent string.
- Cache aggressively — Don't re-scrape pages that haven't changed. It's wasteful for you and the target.
FAQ
Why am I getting 403 Forbidden when scraping?
A 403 means the server is actively blocking your request. The most common causes are: missing or suspicious User-Agent header, IP address blacklisting, missing cookies from a prior JavaScript challenge, or your TLS fingerprint not matching the claimed browser. Start by fixing headers and adding proxy rotation.
How do I scrape Cloudflare-protected websites?
Cloudflare uses JavaScript challenges, CAPTCHAs, and browser fingerprinting. To bypass Cloudflare: use a real browser (Playwright with stealth), use residential proxies, solve the JavaScript challenge on first visit and preserve the cf_clearance cookie, or use a scraping API like SimpleCrawl that handles Cloudflare automatically.
Is it legal to scrape websites?
In the US, the LinkedIn v. hiQ Labs case established that scraping publicly available data is generally legal. However, violating a site's Terms of Service could expose you to breach-of-contract claims. EU law is more nuanced with GDPR implications. Consult a lawyer for your specific use case.
How many requests per second can I send without getting blocked?
There's no universal answer — it depends on the target site's anti-bot sensitivity. Conservative starting points: 1 request every 2-3 seconds for a single IP, or 5-10 requests per second across a pool of 50+ rotating proxies. Monitor for 429 (Too Many Requests) and 403 responses and back off when they appear.
Do I need residential proxies or are datacenter proxies enough?
For sites without aggressive anti-bot protection (small businesses, blogs, government sites), datacenter proxies work fine. For large commercial sites (Amazon, LinkedIn, Google) and sites using Cloudflare, DataDome, or PerimeterX, residential proxies are essentially required. Mobile proxies are the nuclear option for the most heavily protected targets.
Ready to try SimpleCrawl?
We're building the simplest web scraping API for AI. Join the waitlist and get 500 free credits at launch.