Web Scraping with Node.js: Complete Tutorial (2026)
Build powerful web scrapers with Node.js using Playwright, Crawlee, and async patterns. Learn advanced techniques for data extraction at scale with practical code examples.
Web scraping with Node.js combines the power of server-side JavaScript with battle-tested scraping libraries like Playwright, Crawlee, and Cheerio. Node.js excels at concurrent I/O operations, making it ideal for scraping hundreds of pages simultaneously. This tutorial covers advanced Node.js scraping techniques — from building resilient crawlers to handling anti-bot measures — with production-grade code you can use immediately.
Why Node.js for Web Scraping?
Node.js offers distinct advantages for scraping at scale:
- Non-blocking I/O — handle hundreds of concurrent HTTP requests without threading complexity
- Playwright native support — Microsoft's browser automation library was built for Node.js first
- Crawlee framework — the most sophisticated open-source crawling framework runs on Node.js
- TypeScript support — catch scraping logic errors at compile time (see our TypeScript guide)
- Streaming — process large responses without loading everything into memory
For basic JavaScript scraping patterns, see our JavaScript guide. For Python alternatives, check our Python guide.
Project Setup
mkdir node-scraper && cd node-scraper
npm init -y
npm install playwright cheerio crawlee
npx playwright install chromium
Update package.json:
{
"type": "module",
"scripts": {
"scrape": "node src/index.js"
}
}
Method 1: Playwright (Browser Automation)
Playwright is the gold standard for scraping JavaScript-heavy sites from Node.js.
Basic Page Scraping
import { chromium } from "playwright";
async function scrapeWithPlaywright(url) {
const browser = await chromium.launch({ headless: true });
const context = await browser.newContext({
userAgent:
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) " +
"AppleWebKit/537.36 Chrome/122.0.0.0 Safari/537.36",
viewport: { width: 1280, height: 720 },
});
const page = await context.newPage();
await page.goto(url, { waitUntil: "networkidle" });
const data = await page.evaluate(() => {
return {
title: document.title,
h1: document.querySelector("h1")?.textContent?.trim(),
meta: document.querySelector('meta[name="description"]')?.content,
links: Array.from(document.querySelectorAll("a[href]")).map((a) => ({
text: a.textContent?.trim(),
href: a.href,
})),
};
});
await browser.close();
return data;
}
const result = await scrapeWithPlaywright("https://example.com");
console.log(result);
Multi-Page Crawling with Context Reuse
import { chromium } from "playwright";
async function crawlSite(startUrl, maxPages = 20) {
const browser = await chromium.launch({ headless: true });
const context = await browser.newContext();
const visited = new Set();
const queue = [startUrl];
const results = [];
while (queue.length > 0 && visited.size < maxPages) {
const url = queue.shift();
if (visited.has(url)) continue;
visited.add(url);
const page = await context.newPage();
try {
await page.goto(url, { waitUntil: "domcontentloaded", timeout: 15000 });
await page.waitForTimeout(1000);
const data = await page.evaluate((baseUrl) => {
const links = Array.from(document.querySelectorAll("a[href]"))
.map((a) => a.href)
.filter((href) => href.startsWith(baseUrl));
return {
url: window.location.href,
title: document.title,
text: document.body?.innerText?.substring(0, 1000),
outLinks: [...new Set(links)],
};
}, new URL(startUrl).origin);
results.push(data);
for (const link of data.outLinks) {
if (!visited.has(link) && !queue.includes(link)) {
queue.push(link);
}
}
} catch (error) {
console.error(`Error on ${url}: ${error.message}`);
} finally {
await page.close();
}
}
await browser.close();
return results;
}
Handling Authentication
import { chromium } from "playwright";
async function scrapeWithAuth(loginUrl, targetUrl, credentials) {
const browser = await chromium.launch({ headless: true });
const context = await browser.newContext();
const page = await context.newPage();
await page.goto(loginUrl);
await page.fill('input[name="email"]', credentials.email);
await page.fill('input[name="password"]', credentials.password);
await page.click('button[type="submit"]');
await page.waitForNavigation();
// Session cookies persist across pages in the same context
await page.goto(targetUrl, { waitUntil: "networkidle" });
const data = await page.evaluate(() => {
// Extract authenticated content
return {
profile: document.querySelector(".profile-name")?.textContent,
dashboard: document.querySelector(".dashboard-data")?.textContent,
};
});
// Save session for later use
const storageState = await context.storageState();
await browser.close();
return { data, storageState };
}
Stealth Mode
To avoid bot detection, configure the browser context carefully:
import { chromium } from "playwright";
async function stealthBrowser() {
const browser = await chromium.launch({
headless: true,
args: [
"--disable-blink-features=AutomationControlled",
"--no-sandbox",
"--disable-dev-shm-usage",
],
});
const context = await browser.newContext({
userAgent:
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) " +
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
viewport: { width: 1280, height: 720 },
locale: "en-US",
timezoneId: "America/New_York",
geolocation: { longitude: -73.935242, latitude: 40.73061 },
permissions: ["geolocation"],
});
// Override navigator properties
await context.addInitScript(() => {
Object.defineProperty(navigator, "webdriver", { get: () => undefined });
});
return { browser, context };
}
Method 2: Crawlee (Production Crawling Framework)
Crawlee (from the Apify team) is the most feature-rich crawling framework for Node.js, with built-in request queuing, auto-scaling, proxy rotation, and session management.
Basic Crawlee Spider
import { PlaywrightCrawler, Dataset } from "crawlee";
const crawler = new PlaywrightCrawler({
maxConcurrency: 5,
maxRequestsPerMinute: 60,
async requestHandler({ page, request, enqueueLinks }) {
const title = await page.title();
const h1 = await page.textContent("h1");
const products = await page.evaluate(() =>
Array.from(document.querySelectorAll(".product")).map((el) => ({
name: el.querySelector("h3")?.textContent,
price: el.querySelector(".price")?.textContent,
}))
);
await Dataset.pushData({
url: request.url,
title,
h1,
products,
});
// Automatically discover and enqueue links
await enqueueLinks({
globs: ["https://example.com/products/**"],
});
},
failedRequestHandler({ request, error }) {
console.error(`Failed: ${request.url} — ${error.message}`);
},
});
await crawler.run(["https://example.com/products"]);
const dataset = await Dataset.open();
const data = await dataset.getData();
console.log(`Scraped ${data.items.length} pages`);
Crawlee with Proxy Rotation
import { PlaywrightCrawler, ProxyConfiguration } from "crawlee";
const proxyConfiguration = new ProxyConfiguration({
proxyUrls: [
"http://user:pass@proxy1.example.com:8080",
"http://user:pass@proxy2.example.com:8080",
"http://user:pass@proxy3.example.com:8080",
],
});
const crawler = new PlaywrightCrawler({
proxyConfiguration,
maxConcurrency: 3,
sessionPoolOptions: {
maxPoolSize: 10,
sessionOptions: {
maxUsageCount: 50,
},
},
async requestHandler({ page, request }) {
const data = await page.evaluate(() => ({
title: document.title,
content: document.body.innerText.substring(0, 500),
}));
console.log(`${request.url}: ${data.title}`);
},
});
await crawler.run(["https://example.com"]);
Crawlee with Cheerio (Lightweight)
For static pages, Crawlee's CheerioCrawler skips browser overhead:
import { CheerioCrawler, Dataset } from "crawlee";
const crawler = new CheerioCrawler({
maxConcurrency: 10,
maxRequestsPerMinute: 120,
async requestHandler({ $, request, enqueueLinks }) {
const title = $("title").text();
const articles = [];
$("article").each((_, el) => {
articles.push({
title: $(el).find("h2").text().trim(),
excerpt: $(el).find("p").first().text().trim(),
link: $(el).find("a").attr("href"),
});
});
await Dataset.pushData({ url: request.url, title, articles });
await enqueueLinks({
globs: ["https://example.com/blog/**"],
});
},
});
await crawler.run(["https://example.com/blog"]);
Method 3: SimpleCrawl API (Zero Infrastructure)
Skip browser management, proxy rotation, and CAPTCHA handling entirely:
const API_KEY = "sc_your_api_key";
async function simpleCrawl(url, format = "markdown") {
const response = await fetch("https://api.simplecrawl.com/v1/scrape", {
method: "POST",
headers: {
Authorization: `Bearer ${API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({ url, format }),
});
if (!response.ok) {
throw new Error(`SimpleCrawl error: ${response.status}`);
}
return response.json();
}
// Get markdown from any page
const page = await simpleCrawl("https://example.com");
console.log(page.markdown);
// Extract structured data
const products = await simpleCrawl("https://example.com/products");
Batch Scraping with Concurrency Control
async function batchScrape(urls, concurrency = 5) {
const results = [];
for (let i = 0; i < urls.length; i += concurrency) {
const batch = urls.slice(i, i + concurrency);
const batchResults = await Promise.all(
batch.map((url) => simpleCrawl(url).catch((e) => ({ url, error: e.message })))
);
results.push(...batchResults);
}
return results;
}
Check the pricing page for credits and rate limits.
Advanced Patterns
Streaming Large Responses
import { createWriteStream } from "fs";
import { pipeline } from "stream/promises";
async function streamToFile(url, outputPath) {
const response = await fetch(url);
if (!response.ok) throw new Error(`HTTP ${response.status}`);
await pipeline(response.body, createWriteStream(outputPath));
}
Worker Threads for CPU-Intensive Parsing
import { Worker, isMainThread, parentPort, workerData } from "worker_threads";
import * as cheerio from "cheerio";
if (isMainThread) {
async function parseInWorker(html) {
return new Promise((resolve, reject) => {
const worker = new Worker(new URL(import.meta.url), {
workerData: { html },
});
worker.on("message", resolve);
worker.on("error", reject);
});
}
const html = await (await fetch("https://example.com")).text();
const data = await parseInWorker(html);
console.log(data);
} else {
const $ = cheerio.load(workerData.html);
const links = [];
$("a[href]").each((_, el) => links.push($(el).attr("href")));
parentPort.postMessage({ linkCount: links.length, links: links.slice(0, 20) });
}
Graceful Shutdown
import { chromium } from "playwright";
let browser;
process.on("SIGINT", async () => {
console.log("\nShutting down gracefully...");
if (browser) await browser.close();
process.exit(0);
});
process.on("SIGTERM", async () => {
if (browser) await browser.close();
process.exit(0);
});
browser = await chromium.launch({ headless: true });
// ... scraping logic
Choosing the Right Approach
| Approach | Best For | Scale | JS Rendering | Anti-Bot |
|---|---|---|---|---|
| fetch + cheerio | Static pages | High | No | Manual |
| Playwright | SPAs, auth flows | Medium | Yes | Some |
| Crawlee | Site-wide crawls | Very High | Yes | Built-in |
| SimpleCrawl API | Everything | Unlimited | Yes | Built-in |
For scraping specific sites, check our domain guides: Amazon, LinkedIn, Google, Reddit. Compare all approaches on our best web scraping APIs page.
FAQ
Should I use Puppeteer or Playwright with Node.js?
Playwright is recommended in 2026. It supports Chromium, Firefox, and WebKit, has better auto-waiting, and the API is more ergonomic. Puppeteer is Chrome-only but remains well-maintained.
How many pages can Node.js scrape concurrently?
With fetch + cheerio, Node.js handles 50-100 concurrent requests efficiently. Playwright browser instances require more memory — typically 5-10 concurrent pages per GB of RAM. SimpleCrawl scales independently of your infrastructure.
How do I handle memory leaks in long-running scrapers?
Close pages and contexts after use, limit concurrent browser instances, and monitor memory with process.memoryUsage(). Crawlee handles resource management automatically.
Is Crawlee better than Scrapy?
Crawlee and Scrapy serve similar purposes in their respective ecosystems. Crawlee has better built-in browser automation and session management. Scrapy has a larger community and more middleware. Both are excellent choices — pick based on your language preference.
How do I deploy a Node.js scraper?
For periodic jobs, use cron on a VPS or cloud functions (AWS Lambda, Vercel). For always-on crawlers, deploy to a container service (Docker + ECS/GKE). Or use SimpleCrawl and skip deployment entirely.
Ready to try SimpleCrawl?
We're building the simplest web scraping API for AI. Join the waitlist and get 500 free credits at launch.
More guides
Web Scraping with Go: Colly, Goquery, and Beyond (2026)
Build fast, concurrent web scrapers with Go using Colly and Goquery. Learn high-performance data extraction patterns for production systems.
Web Scraping with JavaScript: Node.js Guide (2026)
Master web scraping with JavaScript using fetch, cheerio, and Puppeteer. Learn practical data extraction techniques for Node.js, plus how SimpleCrawl makes it effortless.
Web Scraping with Python: The Complete Guide (2026)
Master web scraping with Python using requests, BeautifulSoup, Playwright, and Scrapy. Learn practical techniques for extracting data from any website, plus how SimpleCrawl simplifies everything.
Web Scraping with TypeScript: Type-Safe Data Extraction (2026)
Build reliable web scrapers with TypeScript using typed schemas, Zod validation, Playwright, and Cheerio. Catch scraping bugs at compile time, not in production.