SimpleCrawl

Web Scraping with JavaScript: Node.js Guide (2026)

Master web scraping with JavaScript using fetch, cheerio, and Puppeteer. Learn practical data extraction techniques for Node.js, plus how SimpleCrawl makes it effortless.

8 min read
javascriptnode.jstutorialweb scrapingpuppeteercheerio

Web scraping with JavaScript leverages the same language that powers the websites you're scraping. With Node.js, you get access to powerful HTTP clients, DOM parsers, and headless browser automation tools that make extracting web data straightforward. This guide covers every major JavaScript scraping approach — from lightweight fetch + cheerio to full browser automation with Puppeteer — with production-ready code examples.

Why JavaScript for Web Scraping?

JavaScript brings unique advantages to web scraping:

  • Same language as the target — you already understand the DOM, CSS selectors, and how web pages work
  • Async by default — Node.js's event loop makes concurrent HTTP requests natural
  • Puppeteer integration — Chrome's official headless browser API is a first-class JavaScript tool
  • NPM ecosystem — thousands of packages for parsing, crawling, and data processing
  • Full-stack workflows — scrape data and serve it through an Express/Next.js API in the same language

For Python alternatives, see our Python scraping guide. For type-safe scraping, check the TypeScript guide.

Setting Up Your Environment

mkdir my-scraper && cd my-scraper
npm init -y
npm install cheerio puppeteer

If you're using ES modules (recommended):

{
  "type": "module"
}

Method 1: Fetch + Cheerio (Static Pages)

For server-rendered HTML pages that don't require JavaScript execution, fetch (built into Node.js 18+) and cheerio (jQuery-like DOM parser) are the fastest approach.

Basic Scraping

import * as cheerio from "cheerio";

async function scrapeHackerNews() {
  const response = await fetch("https://news.ycombinator.com/");
  const html = await response.text();
  const $ = cheerio.load(html);

  const stories = [];

  $(".athing").each((i, element) => {
    const title = $(element).find(".titleline a").first().text();
    const url = $(element).find(".titleline a").first().attr("href");
    const rank = $(element).find(".rank").text().replace(".", "");

    stories.push({ rank: parseInt(rank), title, url });
  });

  return stories;
}

const stories = await scrapeHackerNews();
stories.slice(0, 10).forEach((s) => {
  console.log(`${s.rank}. ${s.title}`);
  console.log(`   ${s.url}\n`);
});

Handling Headers and Cookies

async function fetchWithHeaders(url) {
  const response = await fetch(url, {
    headers: {
      "User-Agent":
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) " +
        "AppleWebKit/537.36 Chrome/122.0.0.0 Safari/537.36",
      Accept: "text/html,application/xhtml+xml,application/xml;q=0.9",
      "Accept-Language": "en-US,en;q=0.9",
    },
  });

  if (!response.ok) {
    throw new Error(`HTTP ${response.status}: ${response.statusText}`);
  }

  return response.text();
}

Pagination

async function scrapeAllPages(baseUrl, maxPages = 5) {
  const allItems = [];

  for (let page = 1; page <= maxPages; page++) {
    const url = `${baseUrl}?page=${page}`;
    const html = await fetchWithHeaders(url);
    const $ = cheerio.load(html);

    const items = [];
    $(".product-card").each((_, el) => {
      items.push({
        name: $(el).find("h3").text().trim(),
        price: $(el).find(".price").text().trim(),
        link: $(el).find("a").attr("href"),
      });
    });

    if (items.length === 0) break;
    allItems.push(...items);

    // Respect rate limits
    await new Promise((r) => setTimeout(r, 1500));
  }

  return allItems;
}

Concurrent Requests

async function scrapeUrls(urls, concurrency = 5) {
  const results = [];
  const chunks = [];

  for (let i = 0; i < urls.length; i += concurrency) {
    chunks.push(urls.slice(i, i + concurrency));
  }

  for (const chunk of chunks) {
    const promises = chunk.map(async (url) => {
      try {
        const html = await fetchWithHeaders(url);
        const $ = cheerio.load(html);
        return {
          url,
          title: $("title").text(),
          h1: $("h1").first().text(),
          paragraphs: $("p").length,
        };
      } catch (error) {
        return { url, error: error.message };
      }
    });

    const batchResults = await Promise.all(promises);
    results.push(...batchResults);

    await new Promise((r) => setTimeout(r, 1000));
  }

  return results;
}

Method 2: Puppeteer (JavaScript-Heavy Sites)

For single-page applications (React, Vue, Angular) that require JavaScript execution, Puppeteer automates a real Chrome browser.

Basic Puppeteer Scraping

import puppeteer from "puppeteer";

async function scrapeSPA(url) {
  const browser = await puppeteer.launch({ headless: true });
  const page = await browser.newPage();

  await page.setUserAgent(
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) " +
      "AppleWebKit/537.36 Chrome/122.0.0.0 Safari/537.36"
  );

  await page.goto(url, { waitUntil: "networkidle0" });

  const data = await page.evaluate(() => {
    const products = [];
    document.querySelectorAll(".product-card").forEach((card) => {
      products.push({
        name: card.querySelector("h3")?.textContent?.trim(),
        price: card.querySelector(".price")?.textContent?.trim(),
        image: card.querySelector("img")?.src,
      });
    });
    return products;
  });

  await browser.close();
  return data;
}

Waiting for Dynamic Content

import puppeteer from "puppeteer";

async function scrapeWithWaits(url) {
  const browser = await puppeteer.launch({ headless: true });
  const page = await browser.newPage();
  await page.goto(url);

  // Wait for a specific selector
  await page.waitForSelector("[data-loaded='true']", { timeout: 10000 });

  // Wait for network requests to finish
  await page.waitForNetworkIdle({ idleTime: 500 });

  // Wait for specific text to appear
  await page.waitForFunction(
    () => document.body.innerText.includes("Results loaded"),
    { timeout: 10000 }
  );

  const content = await page.content();
  await browser.close();
  return content;
}

Handling Infinite Scroll

async function scrapeInfiniteScroll(url, maxScrolls = 10) {
  const browser = await puppeteer.launch({ headless: true });
  const page = await browser.newPage();
  await page.goto(url, { waitUntil: "networkidle0" });

  let previousHeight = 0;

  for (let i = 0; i < maxScrolls; i++) {
    await page.evaluate("window.scrollTo(0, document.body.scrollHeight)");
    await new Promise((r) => setTimeout(r, 2000));

    const currentHeight = await page.evaluate("document.body.scrollHeight");
    if (currentHeight === previousHeight) break;
    previousHeight = currentHeight;
  }

  const items = await page.evaluate(() => {
    return Array.from(document.querySelectorAll(".item")).map((el) => ({
      title: el.querySelector("h3")?.textContent,
      description: el.querySelector("p")?.textContent,
    }));
  });

  await browser.close();
  return items;
}

Intercepting Network Requests

Sometimes the data you need is in API responses, not the DOM:

async function interceptApiData(url) {
  const browser = await puppeteer.launch({ headless: true });
  const page = await browser.newPage();

  const apiData = [];

  page.on("response", async (response) => {
    const reqUrl = response.url();
    if (reqUrl.includes("/api/products")) {
      try {
        const json = await response.json();
        apiData.push(json);
      } catch {
        // Not JSON, skip
      }
    }
  });

  await page.goto(url, { waitUntil: "networkidle0" });
  await browser.close();

  return apiData;
}

Method 3: SimpleCrawl API (Easiest)

All the above methods require managing browsers, proxies, and anti-bot measures. SimpleCrawl reduces this to a single fetch call:

const SIMPLECRAWL_API_KEY = "sc_your_api_key";

async function scrape(url, format = "markdown") {
  const response = await fetch("https://api.simplecrawl.com/v1/scrape", {
    method: "POST",
    headers: {
      Authorization: `Bearer ${SIMPLECRAWL_API_KEY}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({ url, format }),
  });

  return response.json();
}

const result = await scrape("https://example.com/products");
console.log(result.markdown);

Structured Extraction

async function extractStructured(url, schema) {
  const response = await fetch("https://api.simplecrawl.com/v1/scrape", {
    method: "POST",
    headers: {
      Authorization: `Bearer ${SIMPLECRAWL_API_KEY}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({ url, format: "extract", schema }),
  });

  return response.json();
}

const products = await extractStructured("https://example.com/products", {
  products: [
    {
      name: "string",
      price: "number",
      rating: "number",
      in_stock: "boolean",
    },
  ],
});

Batch Scraping with Promise.all

async function batchScrape(urls) {
  const results = await Promise.all(
    urls.map((url) => scrape(url, "markdown"))
  );
  return results;
}

See the pricing page for API credits and rate limits.

Saving Scraped Data

To JSON

import { writeFile } from "fs/promises";

await writeFile("data.json", JSON.stringify(data, null, 2));

To CSV

import { writeFile } from "fs/promises";

function toCSV(data) {
  if (!data.length) return "";
  const headers = Object.keys(data[0]).join(",");
  const rows = data.map((row) =>
    Object.values(row)
      .map((v) => `"${String(v).replace(/"/g, '""')}"`)
      .join(",")
  );
  return [headers, ...rows].join("\n");
}

await writeFile("data.csv", toCSV(products));

Avoiding Common Anti-Bot Measures

Rotating User Agents

const USER_AGENTS = [
  "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 Chrome/122.0.0.0 Safari/537.36",
  "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/122.0.0.0 Safari/537.36",
  "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/122.0.0.0 Safari/537.36",
];

function randomUserAgent() {
  return USER_AGENTS[Math.floor(Math.random() * USER_AGENTS.length)];
}

Rate Limiting with Backoff

async function fetchWithRetry(url, maxRetries = 3) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    const response = await fetch(url, {
      headers: { "User-Agent": randomUserAgent() },
    });

    if (response.status === 429) {
      const wait = Math.pow(2, attempt) * 1000 + Math.random() * 1000;
      console.log(`Rate limited. Waiting ${(wait / 1000).toFixed(1)}s...`);
      await new Promise((r) => setTimeout(r, wait));
      continue;
    }

    return response;
  }

  throw new Error(`Failed after ${maxRetries} retries`);
}

Choosing the Right Approach

ApproachBest ForJS RenderingSpeedComplexity
fetch + cheerioStatic HTMLNoFastLow
PuppeteerSPAs, dynamic sitesYesSlowMedium
SimpleCrawl APIEverythingYesFastVery Low

For scraping specific sites, check our domain guides: Amazon, Google, Reddit, LinkedIn. For comparisons, see our best web scraping APIs page.

FAQ

Should I use Puppeteer or Playwright for JavaScript scraping?

Both are excellent. Puppeteer is Chrome-only and maintained by the Chrome team. Playwright supports multiple browsers and has slightly better API ergonomics. For most scraping tasks, either works. See our Node.js guide for Playwright examples.

Can I scrape websites with JavaScript without a headless browser?

Yes, for server-rendered pages. Use fetch + cheerio for sites that serve HTML without requiring JavaScript execution. For SPAs (React, Vue, Angular), you need a headless browser or SimpleCrawl.

How do I handle CAPTCHAs in JavaScript scraping?

CAPTCHAs require either manual solving, third-party services (2Captcha, Anti-Captcha), or using SimpleCrawl which includes built-in CAPTCHA solving.

Is JavaScript faster than Python for web scraping?

Node.js's event-driven architecture handles concurrent HTTP requests efficiently. For I/O-bound scraping (many HTTP requests), JavaScript can be faster than synchronous Python. However, Python's Scrapy framework matches Node.js performance at scale.

How do I scrape a React or Next.js website?

React/Next.js apps require JavaScript rendering. Use Puppeteer to load the page, wait for React to mount, then extract data from the rendered DOM. Or use SimpleCrawl, which handles JS rendering automatically.

Ready to try SimpleCrawl?

We're building the simplest web scraping API for AI. Join the waitlist and get 500 free credits at launch.

More guides

Get early access + 500 free credits