SimpleCrawl
Back to Blog
Web ScrapingJavaScriptHeadless Browsers

How to Scrape JavaScript-Heavy Websites in 2026

A developer's guide to scraping SPAs, React apps, and dynamically rendered pages. Covers headless browsers, rendering strategies, and modern API-based approaches.

SimpleCrawl TeamFebruary 25, 202610 min read

Why Scraping JavaScript Websites Is Different

If you've tried to scrape a JavaScript website with a simple HTTP request, you've seen the problem: the response contains an empty <div id="root"></div> and a pile of script tags. The actual content isn't in the HTML — it's rendered by JavaScript after the page loads in a browser.

Quick answer: To scrape JavaScript-heavy websites, you need a tool that executes JavaScript — either a headless browser like Playwright/Puppeteer, or a web scraping API like SimpleCrawl that handles rendering for you.

This is the reality of the modern web. Over 70% of the top 10,000 websites use JavaScript frameworks like React, Vue, Angular, or Next.js for rendering. Single-page applications (SPAs), infinite scroll feeds, and client-side data fetching are everywhere. Traditional scraping tools that only parse the initial HTML response will miss most of the content.

This guide covers every practical approach to scraping dynamic websites in 2026, from running your own headless browser to using API-based solutions that abstract away the complexity.

Understanding How JavaScript Rendering Works

Before choosing a scraping strategy, it helps to understand what's happening under the hood when a browser loads a JavaScript-heavy page.

Server-Side Rendering (SSR) vs Client-Side Rendering (CSR)

Server-Side Rendered (SSR) pages send complete HTML from the server. The JavaScript on the page enhances interactivity but the content is already in the HTML response. Sites built with Next.js, Nuxt, or Remix often use SSR. These are relatively straightforward to scrape — a regular HTTP request will get the content.

Client-Side Rendered (CSR) pages send a minimal HTML shell and rely on JavaScript to fetch data and build the DOM. React SPAs, Angular apps, and many dashboards work this way. These require a browser environment to scrape.

How to Tell If a Site Uses CSR

The quickest test:

# Fetch with curl (no JavaScript execution)
curl -s "https://target-site.com" | head -100

# If you see something like this, it's client-side rendered:
# <div id="root"></div>
# <script src="/static/js/main.abc123.js"></script>

You can also check in Chrome DevTools: right-click → View Page Source. If the source is mostly empty with script tags, JavaScript rendering is required.

Approach 1: Headless Browsers (Playwright / Puppeteer)

The most common approach to scraping JavaScript websites is running a real browser in headless mode — no visible window, but full JavaScript execution capability.

Playwright is the current best choice for headless scraping. It supports Chromium, Firefox, and WebKit, handles modern web features well, and has a cleaner API than Puppeteer.

from playwright.sync_api import sync_playwright

def scrape_with_playwright(url: str) -> str:
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()

        page.goto(url, wait_until="networkidle")

        # Wait for specific content to load
        page.wait_for_selector("main", timeout=10000)

        content = page.content()
        browser.close()

    return content

html = scrape_with_playwright("https://example-spa.com/products")

Waiting for Dynamic Content

The trickiest part of headless scraping is knowing when the page is "done" loading. JavaScript apps often make multiple API calls after initial load:

from playwright.sync_api import sync_playwright

def scrape_with_smart_waiting(url: str) -> str:
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()

        page.goto(url)

        # Strategy 1: Wait for network to be idle
        page.wait_for_load_state("networkidle")

        # Strategy 2: Wait for a specific element
        page.wait_for_selector("[data-testid='product-list']", timeout=15000)

        # Strategy 3: Wait for text content to appear
        page.wait_for_function(
            "document.querySelector('main')?.textContent?.length > 100"
        )

        content = page.inner_text("main")
        browser.close()

    return content

Handling Infinite Scroll

Many modern sites load content as you scroll. You need to simulate scrolling to trigger additional loads:

def scrape_infinite_scroll(url: str, max_scrolls: int = 10) -> str:
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto(url, wait_until="networkidle")

        previous_height = 0
        for _ in range(max_scrolls):
            page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
            page.wait_for_timeout(2000)

            current_height = page.evaluate("document.body.scrollHeight")
            if current_height == previous_height:
                break
            previous_height = current_height

        content = page.content()
        browser.close()

    return content

The Problems with Running Your Own Headless Browser

Headless browsers work, but they come with significant operational overhead:

  • Resource-intensive — Each browser instance uses 200-500MB of RAM
  • Slow — Page loads take 3-15 seconds depending on site complexity
  • Fragile — Sites detect headless browsers and block them (see avoiding blocks)
  • Scaling headaches — Running 50 concurrent browser instances requires serious infrastructure
  • Maintenance — Browser versions, dependencies, and anti-detection measures need constant updates

Approach 2: Intercepting API Calls

Many JavaScript SPAs fetch their data from backend APIs. If you can identify and call those APIs directly, you skip the rendering step entirely.

Finding the API Endpoints

Open Chrome DevTools → Network tab → filter by XHR/Fetch, then interact with the page:

from playwright.sync_api import sync_playwright

def discover_api_calls(url: str) -> list[dict]:
    """Record all API calls made by a page during loading."""
    api_calls = []

    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()

        def handle_response(response):
            if "api" in response.url or "graphql" in response.url:
                api_calls.append({
                    "url": response.url,
                    "status": response.status,
                    "method": response.request.method,
                    "content_type": response.headers.get("content-type", ""),
                })

        page.on("response", handle_response)
        page.goto(url, wait_until="networkidle")
        browser.close()

    return api_calls

# Discover what APIs the site uses
calls = discover_api_calls("https://target-spa.com/products")
for call in calls:
    print(f"{call['method']} {call['url']} [{call['status']}]")

Calling APIs Directly

Once you've identified the data API, you can call it with plain HTTP requests — much faster and lighter than a headless browser:

import requests

# Call the API the SPA uses internally
response = requests.get(
    "https://target-spa.com/api/v2/products",
    params={"page": 1, "limit": 50, "category": "electronics"},
    headers={
        "Accept": "application/json",
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
    }
)

products = response.json()

This approach is the fastest and most reliable when it works, but many APIs require authentication tokens, CORS restrictions make them hard to call externally, and API structures change without notice.

For production use cases, a web scraping API that handles JavaScript rendering, anti-bot bypass, and content extraction is the most practical approach.

Using SimpleCrawl

SimpleCrawl renders JavaScript automatically. No headless browser setup, no waiting strategies, no anti-detection code:

import requests

def scrape_js_site(url: str) -> dict:
    """Scrape any JavaScript-heavy site with a single API call."""
    response = requests.post(
        "https://api.simplecrawl.com/v1/scrape",
        headers={"Authorization": "Bearer sc_your_api_key"},
        json={
            "url": url,
            "format": "markdown",
            "wait_for": "networkidle"
        }
    )
    return response.json()

# Works on React SPAs, Angular apps, Vue sites — anything
result = scrape_js_site("https://complex-spa.com/dashboard")
print(result["markdown"])

Why an API Beats DIY for JavaScript Scraping

FactorDIY Headless BrowserSimpleCrawl API
Setup timeHours to daysMinutes
JS renderingManual configurationAutomatic
Anti-bot bypassBuild and maintain yourselfBuilt-in
ScalingManage browser fleetAPI handles it
Cost at scale$50-500/mo server costsPay per request
MaintenanceOngoingZero

For scraping a handful of pages during development, a local Playwright script works fine. For anything in production — especially at scale — the API approach saves an enormous amount of engineering time.

Handling Common JavaScript Scraping Challenges

Shadow DOM

Web Components use Shadow DOM, which encapsulates elements away from the regular DOM tree. Playwright can pierce shadow DOM:

# Playwright can access shadow DOM elements
element = page.locator("my-component").locator("internal:shadow=.content")
text = element.inner_text()

Client-Side Routing

SPAs use client-side routing — the URL changes but no new page load occurs. To scrape multiple "pages" of an SPA:

def scrape_spa_routes(base_url: str, routes: list[str]) -> list[dict]:
    results = []

    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto(base_url, wait_until="networkidle")

        for route in routes:
            # Navigate using the SPA's router
            page.evaluate(f"window.history.pushState(null, '', '{route}')")
            page.evaluate("window.dispatchEvent(new PopStateEvent('popstate'))")
            page.wait_for_timeout(2000)
            page.wait_for_load_state("networkidle")

            results.append({
                "route": route,
                "content": page.inner_text("main"),
            })

        browser.close()

    return results

Lazy-Loaded Images and Content

Content that loads only when visible requires scrolling into view:

# Scroll an element into view to trigger lazy loading
page.locator("#section-5").scroll_into_view_if_needed()
page.wait_for_timeout(1000)

WebSocket-Driven Content

Some real-time apps push data via WebSockets rather than REST APIs. You can intercept WebSocket frames in Playwright:

def capture_websocket_data(url: str) -> list:
    ws_messages = []

    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()

        def on_websocket(ws):
            ws.on("framereceived", lambda payload: ws_messages.append(payload))

        page.on("websocket", on_websocket)
        page.goto(url, wait_until="networkidle")
        page.wait_for_timeout(5000)
        browser.close()

    return ws_messages

Framework-Specific Tips

React Applications

React apps typically hydrate a root <div> element. Wait for the React tree to mount:

page.wait_for_function(
    "document.querySelector('#root')?.children?.length > 0"
)

For Next.js apps with SSR, check if __NEXT_DATA__ is present — you might be able to extract data directly from the embedded JSON:

next_data = page.evaluate("JSON.parse(document.getElementById('__NEXT_DATA__')?.textContent || '{}')")
props = next_data.get("props", {}).get("pageProps", {})

Angular Applications

Angular apps use zones for change detection. Wait for Angular stability:

page.wait_for_function(
    "window.getAllAngularTestabilities?.()?.every(t => t.isStable())"
)

Vue Applications

Vue 3 apps expose the app instance on the root element:

page.wait_for_function(
    "document.querySelector('#app')?.__vue_app__ !== undefined"
)

Performance Optimization Tips

  1. Block unnecessary resources — Skip images, fonts, and analytics scripts to speed up rendering:
def block_resources(route):
    if route.request.resource_type in ["image", "font", "media"]:
        route.abort()
    else:
        route.continue_()

page.route("**/*", block_resources)
  1. Reuse browser contexts — Don't launch a new browser for every page. Create multiple contexts in one browser instance.

  2. Set viewport size — Some sites serve different content based on viewport. Set it explicitly:

page.set_viewport_size({"width": 1920, "height": 1080})
  1. Use request interception to skip API calls that return data you don't need (analytics, tracking, recommendations).

FAQ

Can I scrape a React SPA without a headless browser?

If the React app uses Server-Side Rendering (SSR) via Next.js, you can often scrape the initial HTML response with a simple HTTP request. For pure client-side rendered React apps, you need JavaScript execution — either a headless browser or a scraping API that renders JavaScript for you.

How long should I wait for JavaScript to render?

It depends on the site. A networkidle wait (no network requests for 500ms) works for most sites. For heavy SPAs, combine it with element-specific waits. Most well-built sites render within 3-5 seconds. If you're waiting longer than 15 seconds, something is likely wrong.

Is headless Chrome detectable?

Yes. Sites can detect headless browsers through various signals: the navigator.webdriver flag, missing browser plugins, incorrect viewport properties, and more. Tools like Playwright Stealth and puppeteer-extra-plugin-stealth help, but determined anti-bot systems can still detect headless browsers. A scraping API like SimpleCrawl handles this for you with built-in fingerprint management.

What's the difference between Playwright and Puppeteer in 2026?

Playwright supports multiple browsers (Chromium, Firefox, WebKit), has better auto-waiting, built-in test generators, and more active development. Puppeteer is Chromium-only but has a larger ecosystem of community plugins. For new projects, Playwright is the recommended choice.

How do I scrape a website with JavaScript rendering at scale?

For large-scale JavaScript scraping (thousands of pages), running your own headless browser fleet becomes expensive and complex. A scraping API is more cost-effective: you pay per request instead of maintaining server infrastructure. SimpleCrawl handles JavaScript rendering, scaling, and anti-bot measures with a single API endpoint.

Ready to try SimpleCrawl?

We're building the simplest web scraping API for AI. Join the waitlist and get 500 free credits at launch.

Get early access + 500 free credits