Web Scraping with JavaScript: Node.js Guide (2026)
Master web scraping with JavaScript using fetch, cheerio, and Puppeteer. Learn practical data extraction techniques for Node.js, plus how SimpleCrawl makes it effortless.
Web scraping with JavaScript leverages the same language that powers the websites you're scraping. With Node.js, you get access to powerful HTTP clients, DOM parsers, and headless browser automation tools that make extracting web data straightforward. This guide covers every major JavaScript scraping approach — from lightweight fetch + cheerio to full browser automation with Puppeteer — with production-ready code examples.
Why JavaScript for Web Scraping?
JavaScript brings unique advantages to web scraping:
- Same language as the target — you already understand the DOM, CSS selectors, and how web pages work
- Async by default — Node.js's event loop makes concurrent HTTP requests natural
- Puppeteer integration — Chrome's official headless browser API is a first-class JavaScript tool
- NPM ecosystem — thousands of packages for parsing, crawling, and data processing
- Full-stack workflows — scrape data and serve it through an Express/Next.js API in the same language
For Python alternatives, see our Python scraping guide. For type-safe scraping, check the TypeScript guide.
Setting Up Your Environment
mkdir my-scraper && cd my-scraper
npm init -y
npm install cheerio puppeteer
If you're using ES modules (recommended):
{
"type": "module"
}
Method 1: Fetch + Cheerio (Static Pages)
For server-rendered HTML pages that don't require JavaScript execution, fetch (built into Node.js 18+) and cheerio (jQuery-like DOM parser) are the fastest approach.
Basic Scraping
import * as cheerio from "cheerio";
async function scrapeHackerNews() {
const response = await fetch("https://news.ycombinator.com/");
const html = await response.text();
const $ = cheerio.load(html);
const stories = [];
$(".athing").each((i, element) => {
const title = $(element).find(".titleline a").first().text();
const url = $(element).find(".titleline a").first().attr("href");
const rank = $(element).find(".rank").text().replace(".", "");
stories.push({ rank: parseInt(rank), title, url });
});
return stories;
}
const stories = await scrapeHackerNews();
stories.slice(0, 10).forEach((s) => {
console.log(`${s.rank}. ${s.title}`);
console.log(` ${s.url}\n`);
});
Handling Headers and Cookies
async function fetchWithHeaders(url) {
const response = await fetch(url, {
headers: {
"User-Agent":
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) " +
"AppleWebKit/537.36 Chrome/122.0.0.0 Safari/537.36",
Accept: "text/html,application/xhtml+xml,application/xml;q=0.9",
"Accept-Language": "en-US,en;q=0.9",
},
});
if (!response.ok) {
throw new Error(`HTTP ${response.status}: ${response.statusText}`);
}
return response.text();
}
Pagination
async function scrapeAllPages(baseUrl, maxPages = 5) {
const allItems = [];
for (let page = 1; page <= maxPages; page++) {
const url = `${baseUrl}?page=${page}`;
const html = await fetchWithHeaders(url);
const $ = cheerio.load(html);
const items = [];
$(".product-card").each((_, el) => {
items.push({
name: $(el).find("h3").text().trim(),
price: $(el).find(".price").text().trim(),
link: $(el).find("a").attr("href"),
});
});
if (items.length === 0) break;
allItems.push(...items);
// Respect rate limits
await new Promise((r) => setTimeout(r, 1500));
}
return allItems;
}
Concurrent Requests
async function scrapeUrls(urls, concurrency = 5) {
const results = [];
const chunks = [];
for (let i = 0; i < urls.length; i += concurrency) {
chunks.push(urls.slice(i, i + concurrency));
}
for (const chunk of chunks) {
const promises = chunk.map(async (url) => {
try {
const html = await fetchWithHeaders(url);
const $ = cheerio.load(html);
return {
url,
title: $("title").text(),
h1: $("h1").first().text(),
paragraphs: $("p").length,
};
} catch (error) {
return { url, error: error.message };
}
});
const batchResults = await Promise.all(promises);
results.push(...batchResults);
await new Promise((r) => setTimeout(r, 1000));
}
return results;
}
Method 2: Puppeteer (JavaScript-Heavy Sites)
For single-page applications (React, Vue, Angular) that require JavaScript execution, Puppeteer automates a real Chrome browser.
Basic Puppeteer Scraping
import puppeteer from "puppeteer";
async function scrapeSPA(url) {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.setUserAgent(
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) " +
"AppleWebKit/537.36 Chrome/122.0.0.0 Safari/537.36"
);
await page.goto(url, { waitUntil: "networkidle0" });
const data = await page.evaluate(() => {
const products = [];
document.querySelectorAll(".product-card").forEach((card) => {
products.push({
name: card.querySelector("h3")?.textContent?.trim(),
price: card.querySelector(".price")?.textContent?.trim(),
image: card.querySelector("img")?.src,
});
});
return products;
});
await browser.close();
return data;
}
Waiting for Dynamic Content
import puppeteer from "puppeteer";
async function scrapeWithWaits(url) {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.goto(url);
// Wait for a specific selector
await page.waitForSelector("[data-loaded='true']", { timeout: 10000 });
// Wait for network requests to finish
await page.waitForNetworkIdle({ idleTime: 500 });
// Wait for specific text to appear
await page.waitForFunction(
() => document.body.innerText.includes("Results loaded"),
{ timeout: 10000 }
);
const content = await page.content();
await browser.close();
return content;
}
Handling Infinite Scroll
async function scrapeInfiniteScroll(url, maxScrolls = 10) {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.goto(url, { waitUntil: "networkidle0" });
let previousHeight = 0;
for (let i = 0; i < maxScrolls; i++) {
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)");
await new Promise((r) => setTimeout(r, 2000));
const currentHeight = await page.evaluate("document.body.scrollHeight");
if (currentHeight === previousHeight) break;
previousHeight = currentHeight;
}
const items = await page.evaluate(() => {
return Array.from(document.querySelectorAll(".item")).map((el) => ({
title: el.querySelector("h3")?.textContent,
description: el.querySelector("p")?.textContent,
}));
});
await browser.close();
return items;
}
Intercepting Network Requests
Sometimes the data you need is in API responses, not the DOM:
async function interceptApiData(url) {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
const apiData = [];
page.on("response", async (response) => {
const reqUrl = response.url();
if (reqUrl.includes("/api/products")) {
try {
const json = await response.json();
apiData.push(json);
} catch {
// Not JSON, skip
}
}
});
await page.goto(url, { waitUntil: "networkidle0" });
await browser.close();
return apiData;
}
Method 3: SimpleCrawl API (Easiest)
All the above methods require managing browsers, proxies, and anti-bot measures. SimpleCrawl reduces this to a single fetch call:
const SIMPLECRAWL_API_KEY = "sc_your_api_key";
async function scrape(url, format = "markdown") {
const response = await fetch("https://api.simplecrawl.com/v1/scrape", {
method: "POST",
headers: {
Authorization: `Bearer ${SIMPLECRAWL_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({ url, format }),
});
return response.json();
}
const result = await scrape("https://example.com/products");
console.log(result.markdown);
Structured Extraction
async function extractStructured(url, schema) {
const response = await fetch("https://api.simplecrawl.com/v1/scrape", {
method: "POST",
headers: {
Authorization: `Bearer ${SIMPLECRAWL_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({ url, format: "extract", schema }),
});
return response.json();
}
const products = await extractStructured("https://example.com/products", {
products: [
{
name: "string",
price: "number",
rating: "number",
in_stock: "boolean",
},
],
});
Batch Scraping with Promise.all
async function batchScrape(urls) {
const results = await Promise.all(
urls.map((url) => scrape(url, "markdown"))
);
return results;
}
See the pricing page for API credits and rate limits.
Saving Scraped Data
To JSON
import { writeFile } from "fs/promises";
await writeFile("data.json", JSON.stringify(data, null, 2));
To CSV
import { writeFile } from "fs/promises";
function toCSV(data) {
if (!data.length) return "";
const headers = Object.keys(data[0]).join(",");
const rows = data.map((row) =>
Object.values(row)
.map((v) => `"${String(v).replace(/"/g, '""')}"`)
.join(",")
);
return [headers, ...rows].join("\n");
}
await writeFile("data.csv", toCSV(products));
Avoiding Common Anti-Bot Measures
Rotating User Agents
const USER_AGENTS = [
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 Chrome/122.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/122.0.0.0 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/122.0.0.0 Safari/537.36",
];
function randomUserAgent() {
return USER_AGENTS[Math.floor(Math.random() * USER_AGENTS.length)];
}
Rate Limiting with Backoff
async function fetchWithRetry(url, maxRetries = 3) {
for (let attempt = 0; attempt < maxRetries; attempt++) {
const response = await fetch(url, {
headers: { "User-Agent": randomUserAgent() },
});
if (response.status === 429) {
const wait = Math.pow(2, attempt) * 1000 + Math.random() * 1000;
console.log(`Rate limited. Waiting ${(wait / 1000).toFixed(1)}s...`);
await new Promise((r) => setTimeout(r, wait));
continue;
}
return response;
}
throw new Error(`Failed after ${maxRetries} retries`);
}
Choosing the Right Approach
| Approach | Best For | JS Rendering | Speed | Complexity |
|---|---|---|---|---|
| fetch + cheerio | Static HTML | No | Fast | Low |
| Puppeteer | SPAs, dynamic sites | Yes | Slow | Medium |
| SimpleCrawl API | Everything | Yes | Fast | Very Low |
For scraping specific sites, check our domain guides: Amazon, Google, Reddit, LinkedIn. For comparisons, see our best web scraping APIs page.
FAQ
Should I use Puppeteer or Playwright for JavaScript scraping?
Both are excellent. Puppeteer is Chrome-only and maintained by the Chrome team. Playwright supports multiple browsers and has slightly better API ergonomics. For most scraping tasks, either works. See our Node.js guide for Playwright examples.
Can I scrape websites with JavaScript without a headless browser?
Yes, for server-rendered pages. Use fetch + cheerio for sites that serve HTML without requiring JavaScript execution. For SPAs (React, Vue, Angular), you need a headless browser or SimpleCrawl.
How do I handle CAPTCHAs in JavaScript scraping?
CAPTCHAs require either manual solving, third-party services (2Captcha, Anti-Captcha), or using SimpleCrawl which includes built-in CAPTCHA solving.
Is JavaScript faster than Python for web scraping?
Node.js's event-driven architecture handles concurrent HTTP requests efficiently. For I/O-bound scraping (many HTTP requests), JavaScript can be faster than synchronous Python. However, Python's Scrapy framework matches Node.js performance at scale.
How do I scrape a React or Next.js website?
React/Next.js apps require JavaScript rendering. Use Puppeteer to load the page, wait for React to mount, then extract data from the rendered DOM. Or use SimpleCrawl, which handles JS rendering automatically.
Ready to try SimpleCrawl?
We're building the simplest web scraping API for AI. Join the waitlist and get 500 free credits at launch.
More guides
Web Scraping with Go: Colly, Goquery, and Beyond (2026)
Build fast, concurrent web scrapers with Go using Colly and Goquery. Learn high-performance data extraction patterns for production systems.
Web Scraping with Node.js: Complete Tutorial (2026)
Build powerful web scrapers with Node.js using Playwright, Crawlee, and async patterns. Learn advanced techniques for data extraction at scale with practical code examples.
Web Scraping with Python: The Complete Guide (2026)
Master web scraping with Python using requests, BeautifulSoup, Playwright, and Scrapy. Learn practical techniques for extracting data from any website, plus how SimpleCrawl simplifies everything.
Web Scraping with TypeScript: Type-Safe Data Extraction (2026)
Build reliable web scrapers with TypeScript using typed schemas, Zod validation, Playwright, and Cheerio. Catch scraping bugs at compile time, not in production.