How to Scrape JavaScript-Heavy Websites in 2026
A developer's guide to scraping SPAs, React apps, and dynamically rendered pages. Covers headless browsers, rendering strategies, and modern API-based approaches.
Why Scraping JavaScript Websites Is Different
If you've tried to scrape a JavaScript website with a simple HTTP request, you've seen the problem: the response contains an empty <div id="root"></div> and a pile of script tags. The actual content isn't in the HTML — it's rendered by JavaScript after the page loads in a browser.
Quick answer: To scrape JavaScript-heavy websites, you need a tool that executes JavaScript — either a headless browser like Playwright/Puppeteer, or a web scraping API like SimpleCrawl that handles rendering for you.
This is the reality of the modern web. Over 70% of the top 10,000 websites use JavaScript frameworks like React, Vue, Angular, or Next.js for rendering. Single-page applications (SPAs), infinite scroll feeds, and client-side data fetching are everywhere. Traditional scraping tools that only parse the initial HTML response will miss most of the content.
This guide covers every practical approach to scraping dynamic websites in 2026, from running your own headless browser to using API-based solutions that abstract away the complexity.
Understanding How JavaScript Rendering Works
Before choosing a scraping strategy, it helps to understand what's happening under the hood when a browser loads a JavaScript-heavy page.
Server-Side Rendering (SSR) vs Client-Side Rendering (CSR)
Server-Side Rendered (SSR) pages send complete HTML from the server. The JavaScript on the page enhances interactivity but the content is already in the HTML response. Sites built with Next.js, Nuxt, or Remix often use SSR. These are relatively straightforward to scrape — a regular HTTP request will get the content.
Client-Side Rendered (CSR) pages send a minimal HTML shell and rely on JavaScript to fetch data and build the DOM. React SPAs, Angular apps, and many dashboards work this way. These require a browser environment to scrape.
How to Tell If a Site Uses CSR
The quickest test:
# Fetch with curl (no JavaScript execution)
curl -s "https://target-site.com" | head -100
# If you see something like this, it's client-side rendered:
# <div id="root"></div>
# <script src="/static/js/main.abc123.js"></script>
You can also check in Chrome DevTools: right-click → View Page Source. If the source is mostly empty with script tags, JavaScript rendering is required.
Approach 1: Headless Browsers (Playwright / Puppeteer)
The most common approach to scraping JavaScript websites is running a real browser in headless mode — no visible window, but full JavaScript execution capability.
Playwright (Recommended)
Playwright is the current best choice for headless scraping. It supports Chromium, Firefox, and WebKit, handles modern web features well, and has a cleaner API than Puppeteer.
from playwright.sync_api import sync_playwright
def scrape_with_playwright(url: str) -> str:
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto(url, wait_until="networkidle")
# Wait for specific content to load
page.wait_for_selector("main", timeout=10000)
content = page.content()
browser.close()
return content
html = scrape_with_playwright("https://example-spa.com/products")
Waiting for Dynamic Content
The trickiest part of headless scraping is knowing when the page is "done" loading. JavaScript apps often make multiple API calls after initial load:
from playwright.sync_api import sync_playwright
def scrape_with_smart_waiting(url: str) -> str:
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto(url)
# Strategy 1: Wait for network to be idle
page.wait_for_load_state("networkidle")
# Strategy 2: Wait for a specific element
page.wait_for_selector("[data-testid='product-list']", timeout=15000)
# Strategy 3: Wait for text content to appear
page.wait_for_function(
"document.querySelector('main')?.textContent?.length > 100"
)
content = page.inner_text("main")
browser.close()
return content
Handling Infinite Scroll
Many modern sites load content as you scroll. You need to simulate scrolling to trigger additional loads:
def scrape_infinite_scroll(url: str, max_scrolls: int = 10) -> str:
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto(url, wait_until="networkidle")
previous_height = 0
for _ in range(max_scrolls):
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
page.wait_for_timeout(2000)
current_height = page.evaluate("document.body.scrollHeight")
if current_height == previous_height:
break
previous_height = current_height
content = page.content()
browser.close()
return content
The Problems with Running Your Own Headless Browser
Headless browsers work, but they come with significant operational overhead:
- Resource-intensive — Each browser instance uses 200-500MB of RAM
- Slow — Page loads take 3-15 seconds depending on site complexity
- Fragile — Sites detect headless browsers and block them (see avoiding blocks)
- Scaling headaches — Running 50 concurrent browser instances requires serious infrastructure
- Maintenance — Browser versions, dependencies, and anti-detection measures need constant updates
Approach 2: Intercepting API Calls
Many JavaScript SPAs fetch their data from backend APIs. If you can identify and call those APIs directly, you skip the rendering step entirely.
Finding the API Endpoints
Open Chrome DevTools → Network tab → filter by XHR/Fetch, then interact with the page:
from playwright.sync_api import sync_playwright
def discover_api_calls(url: str) -> list[dict]:
"""Record all API calls made by a page during loading."""
api_calls = []
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
def handle_response(response):
if "api" in response.url or "graphql" in response.url:
api_calls.append({
"url": response.url,
"status": response.status,
"method": response.request.method,
"content_type": response.headers.get("content-type", ""),
})
page.on("response", handle_response)
page.goto(url, wait_until="networkidle")
browser.close()
return api_calls
# Discover what APIs the site uses
calls = discover_api_calls("https://target-spa.com/products")
for call in calls:
print(f"{call['method']} {call['url']} [{call['status']}]")
Calling APIs Directly
Once you've identified the data API, you can call it with plain HTTP requests — much faster and lighter than a headless browser:
import requests
# Call the API the SPA uses internally
response = requests.get(
"https://target-spa.com/api/v2/products",
params={"page": 1, "limit": 50, "category": "electronics"},
headers={
"Accept": "application/json",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
}
)
products = response.json()
This approach is the fastest and most reliable when it works, but many APIs require authentication tokens, CORS restrictions make them hard to call externally, and API structures change without notice.
Approach 3: Web Scraping APIs (Recommended)
For production use cases, a web scraping API that handles JavaScript rendering, anti-bot bypass, and content extraction is the most practical approach.
Using SimpleCrawl
SimpleCrawl renders JavaScript automatically. No headless browser setup, no waiting strategies, no anti-detection code:
import requests
def scrape_js_site(url: str) -> dict:
"""Scrape any JavaScript-heavy site with a single API call."""
response = requests.post(
"https://api.simplecrawl.com/v1/scrape",
headers={"Authorization": "Bearer sc_your_api_key"},
json={
"url": url,
"format": "markdown",
"wait_for": "networkidle"
}
)
return response.json()
# Works on React SPAs, Angular apps, Vue sites — anything
result = scrape_js_site("https://complex-spa.com/dashboard")
print(result["markdown"])
Why an API Beats DIY for JavaScript Scraping
| Factor | DIY Headless Browser | SimpleCrawl API |
|---|---|---|
| Setup time | Hours to days | Minutes |
| JS rendering | Manual configuration | Automatic |
| Anti-bot bypass | Build and maintain yourself | Built-in |
| Scaling | Manage browser fleet | API handles it |
| Cost at scale | $50-500/mo server costs | Pay per request |
| Maintenance | Ongoing | Zero |
For scraping a handful of pages during development, a local Playwright script works fine. For anything in production — especially at scale — the API approach saves an enormous amount of engineering time.
Handling Common JavaScript Scraping Challenges
Shadow DOM
Web Components use Shadow DOM, which encapsulates elements away from the regular DOM tree. Playwright can pierce shadow DOM:
# Playwright can access shadow DOM elements
element = page.locator("my-component").locator("internal:shadow=.content")
text = element.inner_text()
Client-Side Routing
SPAs use client-side routing — the URL changes but no new page load occurs. To scrape multiple "pages" of an SPA:
def scrape_spa_routes(base_url: str, routes: list[str]) -> list[dict]:
results = []
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto(base_url, wait_until="networkidle")
for route in routes:
# Navigate using the SPA's router
page.evaluate(f"window.history.pushState(null, '', '{route}')")
page.evaluate("window.dispatchEvent(new PopStateEvent('popstate'))")
page.wait_for_timeout(2000)
page.wait_for_load_state("networkidle")
results.append({
"route": route,
"content": page.inner_text("main"),
})
browser.close()
return results
Lazy-Loaded Images and Content
Content that loads only when visible requires scrolling into view:
# Scroll an element into view to trigger lazy loading
page.locator("#section-5").scroll_into_view_if_needed()
page.wait_for_timeout(1000)
WebSocket-Driven Content
Some real-time apps push data via WebSockets rather than REST APIs. You can intercept WebSocket frames in Playwright:
def capture_websocket_data(url: str) -> list:
ws_messages = []
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
def on_websocket(ws):
ws.on("framereceived", lambda payload: ws_messages.append(payload))
page.on("websocket", on_websocket)
page.goto(url, wait_until="networkidle")
page.wait_for_timeout(5000)
browser.close()
return ws_messages
Framework-Specific Tips
React Applications
React apps typically hydrate a root <div> element. Wait for the React tree to mount:
page.wait_for_function(
"document.querySelector('#root')?.children?.length > 0"
)
For Next.js apps with SSR, check if __NEXT_DATA__ is present — you might be able to extract data directly from the embedded JSON:
next_data = page.evaluate("JSON.parse(document.getElementById('__NEXT_DATA__')?.textContent || '{}')")
props = next_data.get("props", {}).get("pageProps", {})
Angular Applications
Angular apps use zones for change detection. Wait for Angular stability:
page.wait_for_function(
"window.getAllAngularTestabilities?.()?.every(t => t.isStable())"
)
Vue Applications
Vue 3 apps expose the app instance on the root element:
page.wait_for_function(
"document.querySelector('#app')?.__vue_app__ !== undefined"
)
Performance Optimization Tips
- Block unnecessary resources — Skip images, fonts, and analytics scripts to speed up rendering:
def block_resources(route):
if route.request.resource_type in ["image", "font", "media"]:
route.abort()
else:
route.continue_()
page.route("**/*", block_resources)
-
Reuse browser contexts — Don't launch a new browser for every page. Create multiple contexts in one browser instance.
-
Set viewport size — Some sites serve different content based on viewport. Set it explicitly:
page.set_viewport_size({"width": 1920, "height": 1080})
- Use request interception to skip API calls that return data you don't need (analytics, tracking, recommendations).
FAQ
Can I scrape a React SPA without a headless browser?
If the React app uses Server-Side Rendering (SSR) via Next.js, you can often scrape the initial HTML response with a simple HTTP request. For pure client-side rendered React apps, you need JavaScript execution — either a headless browser or a scraping API that renders JavaScript for you.
How long should I wait for JavaScript to render?
It depends on the site. A networkidle wait (no network requests for 500ms) works for most sites. For heavy SPAs, combine it with element-specific waits. Most well-built sites render within 3-5 seconds. If you're waiting longer than 15 seconds, something is likely wrong.
Is headless Chrome detectable?
Yes. Sites can detect headless browsers through various signals: the navigator.webdriver flag, missing browser plugins, incorrect viewport properties, and more. Tools like Playwright Stealth and puppeteer-extra-plugin-stealth help, but determined anti-bot systems can still detect headless browsers. A scraping API like SimpleCrawl handles this for you with built-in fingerprint management.
What's the difference between Playwright and Puppeteer in 2026?
Playwright supports multiple browsers (Chromium, Firefox, WebKit), has better auto-waiting, built-in test generators, and more active development. Puppeteer is Chromium-only but has a larger ecosystem of community plugins. For new projects, Playwright is the recommended choice.
How do I scrape a website with JavaScript rendering at scale?
For large-scale JavaScript scraping (thousands of pages), running your own headless browser fleet becomes expensive and complex. A scraping API is more cost-effective: you pay per request instead of maintaining server infrastructure. SimpleCrawl handles JavaScript rendering, scaling, and anti-bot measures with a single API endpoint.
Ready to try SimpleCrawl?
We're building the simplest web scraping API for AI. Join the waitlist and get 500 free credits at launch.