SimpleCrawl

Web Scraping with Node.js: Complete Tutorial (2026)

Build powerful web scrapers with Node.js using Playwright, Crawlee, and async patterns. Learn advanced techniques for data extraction at scale with practical code examples.

8 min read
node.jstutorialweb scrapingplaywrightcrawlee

Web scraping with Node.js combines the power of server-side JavaScript with battle-tested scraping libraries like Playwright, Crawlee, and Cheerio. Node.js excels at concurrent I/O operations, making it ideal for scraping hundreds of pages simultaneously. This tutorial covers advanced Node.js scraping techniques — from building resilient crawlers to handling anti-bot measures — with production-grade code you can use immediately.

Why Node.js for Web Scraping?

Node.js offers distinct advantages for scraping at scale:

  • Non-blocking I/O — handle hundreds of concurrent HTTP requests without threading complexity
  • Playwright native support — Microsoft's browser automation library was built for Node.js first
  • Crawlee framework — the most sophisticated open-source crawling framework runs on Node.js
  • TypeScript support — catch scraping logic errors at compile time (see our TypeScript guide)
  • Streaming — process large responses without loading everything into memory

For basic JavaScript scraping patterns, see our JavaScript guide. For Python alternatives, check our Python guide.

Project Setup

mkdir node-scraper && cd node-scraper
npm init -y
npm install playwright cheerio crawlee
npx playwright install chromium

Update package.json:

{
  "type": "module",
  "scripts": {
    "scrape": "node src/index.js"
  }
}

Method 1: Playwright (Browser Automation)

Playwright is the gold standard for scraping JavaScript-heavy sites from Node.js.

Basic Page Scraping

import { chromium } from "playwright";

async function scrapeWithPlaywright(url) {
  const browser = await chromium.launch({ headless: true });
  const context = await browser.newContext({
    userAgent:
      "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) " +
      "AppleWebKit/537.36 Chrome/122.0.0.0 Safari/537.36",
    viewport: { width: 1280, height: 720 },
  });

  const page = await context.newPage();
  await page.goto(url, { waitUntil: "networkidle" });

  const data = await page.evaluate(() => {
    return {
      title: document.title,
      h1: document.querySelector("h1")?.textContent?.trim(),
      meta: document.querySelector('meta[name="description"]')?.content,
      links: Array.from(document.querySelectorAll("a[href]")).map((a) => ({
        text: a.textContent?.trim(),
        href: a.href,
      })),
    };
  });

  await browser.close();
  return data;
}

const result = await scrapeWithPlaywright("https://example.com");
console.log(result);

Multi-Page Crawling with Context Reuse

import { chromium } from "playwright";

async function crawlSite(startUrl, maxPages = 20) {
  const browser = await chromium.launch({ headless: true });
  const context = await browser.newContext();

  const visited = new Set();
  const queue = [startUrl];
  const results = [];

  while (queue.length > 0 && visited.size < maxPages) {
    const url = queue.shift();
    if (visited.has(url)) continue;
    visited.add(url);

    const page = await context.newPage();

    try {
      await page.goto(url, { waitUntil: "domcontentloaded", timeout: 15000 });
      await page.waitForTimeout(1000);

      const data = await page.evaluate((baseUrl) => {
        const links = Array.from(document.querySelectorAll("a[href]"))
          .map((a) => a.href)
          .filter((href) => href.startsWith(baseUrl));

        return {
          url: window.location.href,
          title: document.title,
          text: document.body?.innerText?.substring(0, 1000),
          outLinks: [...new Set(links)],
        };
      }, new URL(startUrl).origin);

      results.push(data);

      for (const link of data.outLinks) {
        if (!visited.has(link) && !queue.includes(link)) {
          queue.push(link);
        }
      }
    } catch (error) {
      console.error(`Error on ${url}: ${error.message}`);
    } finally {
      await page.close();
    }
  }

  await browser.close();
  return results;
}

Handling Authentication

import { chromium } from "playwright";

async function scrapeWithAuth(loginUrl, targetUrl, credentials) {
  const browser = await chromium.launch({ headless: true });
  const context = await browser.newContext();
  const page = await context.newPage();

  await page.goto(loginUrl);
  await page.fill('input[name="email"]', credentials.email);
  await page.fill('input[name="password"]', credentials.password);
  await page.click('button[type="submit"]');
  await page.waitForNavigation();

  // Session cookies persist across pages in the same context
  await page.goto(targetUrl, { waitUntil: "networkidle" });

  const data = await page.evaluate(() => {
    // Extract authenticated content
    return {
      profile: document.querySelector(".profile-name")?.textContent,
      dashboard: document.querySelector(".dashboard-data")?.textContent,
    };
  });

  // Save session for later use
  const storageState = await context.storageState();

  await browser.close();
  return { data, storageState };
}

Stealth Mode

To avoid bot detection, configure the browser context carefully:

import { chromium } from "playwright";

async function stealthBrowser() {
  const browser = await chromium.launch({
    headless: true,
    args: [
      "--disable-blink-features=AutomationControlled",
      "--no-sandbox",
      "--disable-dev-shm-usage",
    ],
  });

  const context = await browser.newContext({
    userAgent:
      "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) " +
      "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
    viewport: { width: 1280, height: 720 },
    locale: "en-US",
    timezoneId: "America/New_York",
    geolocation: { longitude: -73.935242, latitude: 40.73061 },
    permissions: ["geolocation"],
  });

  // Override navigator properties
  await context.addInitScript(() => {
    Object.defineProperty(navigator, "webdriver", { get: () => undefined });
  });

  return { browser, context };
}

Method 2: Crawlee (Production Crawling Framework)

Crawlee (from the Apify team) is the most feature-rich crawling framework for Node.js, with built-in request queuing, auto-scaling, proxy rotation, and session management.

Basic Crawlee Spider

import { PlaywrightCrawler, Dataset } from "crawlee";

const crawler = new PlaywrightCrawler({
  maxConcurrency: 5,
  maxRequestsPerMinute: 60,

  async requestHandler({ page, request, enqueueLinks }) {
    const title = await page.title();
    const h1 = await page.textContent("h1");

    const products = await page.evaluate(() =>
      Array.from(document.querySelectorAll(".product")).map((el) => ({
        name: el.querySelector("h3")?.textContent,
        price: el.querySelector(".price")?.textContent,
      }))
    );

    await Dataset.pushData({
      url: request.url,
      title,
      h1,
      products,
    });

    // Automatically discover and enqueue links
    await enqueueLinks({
      globs: ["https://example.com/products/**"],
    });
  },

  failedRequestHandler({ request, error }) {
    console.error(`Failed: ${request.url} — ${error.message}`);
  },
});

await crawler.run(["https://example.com/products"]);

const dataset = await Dataset.open();
const data = await dataset.getData();
console.log(`Scraped ${data.items.length} pages`);

Crawlee with Proxy Rotation

import { PlaywrightCrawler, ProxyConfiguration } from "crawlee";

const proxyConfiguration = new ProxyConfiguration({
  proxyUrls: [
    "http://user:pass@proxy1.example.com:8080",
    "http://user:pass@proxy2.example.com:8080",
    "http://user:pass@proxy3.example.com:8080",
  ],
});

const crawler = new PlaywrightCrawler({
  proxyConfiguration,
  maxConcurrency: 3,
  sessionPoolOptions: {
    maxPoolSize: 10,
    sessionOptions: {
      maxUsageCount: 50,
    },
  },

  async requestHandler({ page, request }) {
    const data = await page.evaluate(() => ({
      title: document.title,
      content: document.body.innerText.substring(0, 500),
    }));

    console.log(`${request.url}: ${data.title}`);
  },
});

await crawler.run(["https://example.com"]);

Crawlee with Cheerio (Lightweight)

For static pages, Crawlee's CheerioCrawler skips browser overhead:

import { CheerioCrawler, Dataset } from "crawlee";

const crawler = new CheerioCrawler({
  maxConcurrency: 10,
  maxRequestsPerMinute: 120,

  async requestHandler({ $, request, enqueueLinks }) {
    const title = $("title").text();
    const articles = [];

    $("article").each((_, el) => {
      articles.push({
        title: $(el).find("h2").text().trim(),
        excerpt: $(el).find("p").first().text().trim(),
        link: $(el).find("a").attr("href"),
      });
    });

    await Dataset.pushData({ url: request.url, title, articles });

    await enqueueLinks({
      globs: ["https://example.com/blog/**"],
    });
  },
});

await crawler.run(["https://example.com/blog"]);

Method 3: SimpleCrawl API (Zero Infrastructure)

Skip browser management, proxy rotation, and CAPTCHA handling entirely:

const API_KEY = "sc_your_api_key";

async function simpleCrawl(url, format = "markdown") {
  const response = await fetch("https://api.simplecrawl.com/v1/scrape", {
    method: "POST",
    headers: {
      Authorization: `Bearer ${API_KEY}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({ url, format }),
  });

  if (!response.ok) {
    throw new Error(`SimpleCrawl error: ${response.status}`);
  }

  return response.json();
}

// Get markdown from any page
const page = await simpleCrawl("https://example.com");
console.log(page.markdown);

// Extract structured data
const products = await simpleCrawl("https://example.com/products");

Batch Scraping with Concurrency Control

async function batchScrape(urls, concurrency = 5) {
  const results = [];

  for (let i = 0; i < urls.length; i += concurrency) {
    const batch = urls.slice(i, i + concurrency);
    const batchResults = await Promise.all(
      batch.map((url) => simpleCrawl(url).catch((e) => ({ url, error: e.message })))
    );
    results.push(...batchResults);
  }

  return results;
}

Check the pricing page for credits and rate limits.

Advanced Patterns

Streaming Large Responses

import { createWriteStream } from "fs";
import { pipeline } from "stream/promises";

async function streamToFile(url, outputPath) {
  const response = await fetch(url);
  if (!response.ok) throw new Error(`HTTP ${response.status}`);

  await pipeline(response.body, createWriteStream(outputPath));
}

Worker Threads for CPU-Intensive Parsing

import { Worker, isMainThread, parentPort, workerData } from "worker_threads";
import * as cheerio from "cheerio";

if (isMainThread) {
  async function parseInWorker(html) {
    return new Promise((resolve, reject) => {
      const worker = new Worker(new URL(import.meta.url), {
        workerData: { html },
      });
      worker.on("message", resolve);
      worker.on("error", reject);
    });
  }

  const html = await (await fetch("https://example.com")).text();
  const data = await parseInWorker(html);
  console.log(data);
} else {
  const $ = cheerio.load(workerData.html);
  const links = [];
  $("a[href]").each((_, el) => links.push($(el).attr("href")));
  parentPort.postMessage({ linkCount: links.length, links: links.slice(0, 20) });
}

Graceful Shutdown

import { chromium } from "playwright";

let browser;

process.on("SIGINT", async () => {
  console.log("\nShutting down gracefully...");
  if (browser) await browser.close();
  process.exit(0);
});

process.on("SIGTERM", async () => {
  if (browser) await browser.close();
  process.exit(0);
});

browser = await chromium.launch({ headless: true });
// ... scraping logic

Choosing the Right Approach

ApproachBest ForScaleJS RenderingAnti-Bot
fetch + cheerioStatic pagesHighNoManual
PlaywrightSPAs, auth flowsMediumYesSome
CrawleeSite-wide crawlsVery HighYesBuilt-in
SimpleCrawl APIEverythingUnlimitedYesBuilt-in

For scraping specific sites, check our domain guides: Amazon, LinkedIn, Google, Reddit. Compare all approaches on our best web scraping APIs page.

FAQ

Should I use Puppeteer or Playwright with Node.js?

Playwright is recommended in 2026. It supports Chromium, Firefox, and WebKit, has better auto-waiting, and the API is more ergonomic. Puppeteer is Chrome-only but remains well-maintained.

How many pages can Node.js scrape concurrently?

With fetch + cheerio, Node.js handles 50-100 concurrent requests efficiently. Playwright browser instances require more memory — typically 5-10 concurrent pages per GB of RAM. SimpleCrawl scales independently of your infrastructure.

How do I handle memory leaks in long-running scrapers?

Close pages and contexts after use, limit concurrent browser instances, and monitor memory with process.memoryUsage(). Crawlee handles resource management automatically.

Is Crawlee better than Scrapy?

Crawlee and Scrapy serve similar purposes in their respective ecosystems. Crawlee has better built-in browser automation and session management. Scrapy has a larger community and more middleware. Both are excellent choices — pick based on your language preference.

How do I deploy a Node.js scraper?

For periodic jobs, use cron on a VPS or cloud functions (AWS Lambda, Vercel). For always-on crawlers, deploy to a container service (Docker + ECS/GKE). Or use SimpleCrawl and skip deployment entirely.

Ready to try SimpleCrawl?

We're building the simplest web scraping API for AI. Join the waitlist and get 500 free credits at launch.

More guides

Get early access + 500 free credits