Web Scraping with TypeScript: Type-Safe Data Extraction (2026)
Build reliable web scrapers with TypeScript using typed schemas, Zod validation, Playwright, and Cheerio. Catch scraping bugs at compile time, not in production.
Web scraping with TypeScript adds type safety to data extraction, catching structural errors at compile time rather than in production. When a website changes its layout and your selectors break, TypeScript's type system ensures you handle missing data gracefully instead of silently passing undefined through your pipeline. This guide covers type-safe scraping patterns, runtime validation with Zod, and production-ready TypeScript scraper architectures.
Why TypeScript for Web Scraping?
TypeScript transforms web scraping from fragile scripts into maintainable software:
- Typed schemas — define the shape of extracted data and catch structural mismatches early
- Refactoring safety — rename a field and the compiler shows every location that needs updating
- Runtime validation — combine TypeScript types with Zod for compile-time AND runtime safety
- IDE support — autocomplete for extracted data properties, inline documentation, error highlighting
- Team collaboration — shared types make scraper output predictable across services
For JavaScript-only approaches, see our JavaScript guide or Node.js guide. For Python, check our Python guide.
Project Setup
mkdir ts-scraper && cd ts-scraper
npm init -y
npm install typescript tsx cheerio playwright zod
npm install -D @types/node
npx playwright install chromium
npx tsc --init
Configure tsconfig.json:
{
"compilerOptions": {
"target": "ES2022",
"module": "ESNext",
"moduleResolution": "bundler",
"strict": true,
"esModuleInterop": true,
"outDir": "dist",
"rootDir": "src",
"declaration": true
},
"include": ["src"]
}
Run scripts with: npx tsx src/index.ts
Defining Typed Schemas
The foundation of type-safe scraping is defining your data types upfront.
Basic Type Definitions
interface Product {
name: string;
price: number;
currency: string;
rating: number | null;
reviewCount: number;
url: string;
inStock: boolean;
}
interface ScrapeResult<T> {
data: T[];
url: string;
scrapedAt: string;
duration: number;
errors: string[];
}
Zod Schemas for Runtime Validation
TypeScript types disappear at runtime. Zod provides runtime validation that mirrors your TypeScript types:
import { z } from "zod";
const ProductSchema = z.object({
name: z.string().min(1),
price: z.number().positive(),
currency: z.string().length(3),
rating: z.number().min(0).max(5).nullable(),
reviewCount: z.number().int().nonnegative(),
url: z.string().url(),
inStock: z.boolean(),
});
type Product = z.infer<typeof ProductSchema>;
const ProductListSchema = z.array(ProductSchema);
function validateProducts(raw: unknown[]): Product[] {
const validated: Product[] = [];
const errors: string[] = [];
for (const item of raw) {
const result = ProductSchema.safeParse(item);
if (result.success) {
validated.push(result.data);
} else {
errors.push(
`Validation failed: ${result.error.issues.map((i) => i.message).join(", ")}`
);
}
}
if (errors.length > 0) {
console.warn(`${errors.length} items failed validation:`, errors);
}
return validated;
}
Method 1: Cheerio with Type Safety
Typed Scraper Function
import * as cheerio from "cheerio";
interface ScraperConfig {
url: string;
headers?: Record<string, string>;
timeout?: number;
}
async function fetchAndParse(config: ScraperConfig): Promise<cheerio.CheerioAPI> {
const { url, headers = {}, timeout = 10000 } = config;
const controller = new AbortController();
const timer = setTimeout(() => controller.abort(), timeout);
try {
const response = await fetch(url, {
headers: {
"User-Agent":
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) " +
"AppleWebKit/537.36 Chrome/122.0.0.0 Safari/537.36",
...headers,
},
signal: controller.signal,
});
if (!response.ok) {
throw new Error(`HTTP ${response.status}: ${response.statusText}`);
}
const html = await response.text();
return cheerio.load(html);
} finally {
clearTimeout(timer);
}
}
async function scrapeProducts(url: string): Promise<Product[]> {
const $ = await fetchAndParse({ url });
const rawProducts: unknown[] = [];
$(".product-card").each((_, element) => {
const $el = $(element);
const priceText = $el.find(".price").text().replace(/[^0-9.]/g, "");
const ratingText = $el.find(".rating").text();
rawProducts.push({
name: $el.find("h3").text().trim(),
price: parseFloat(priceText) || 0,
currency: "USD",
rating: ratingText ? parseFloat(ratingText) : null,
reviewCount: parseInt($el.find(".reviews").text()) || 0,
url: $el.find("a").attr("href") || "",
inStock: !$el.hasClass("out-of-stock"),
});
});
return validateProducts(rawProducts);
}
Generic Scraper Pattern
type Extractor<T> = ($: cheerio.CheerioAPI) => T[];
function createScraper<T extends z.ZodTypeAny>(schema: T) {
type OutputType = z.infer<T>;
return async function scrape(
url: string,
extractor: Extractor<unknown>
): Promise<ScrapeResult<OutputType>> {
const start = Date.now();
const errors: string[] = [];
const $ = await fetchAndParse({ url });
const rawItems = extractor($);
const data: OutputType[] = [];
for (const item of rawItems) {
const result = schema.safeParse(item);
if (result.success) {
data.push(result.data);
} else {
errors.push(result.error.message);
}
}
return {
data,
url,
scrapedAt: new Date().toISOString(),
duration: Date.now() - start,
errors,
};
};
}
// Usage
const scrapeProducts = createScraper(ProductSchema);
const result = await scrapeProducts(
"https://example.com/products",
($) => {
const items: unknown[] = [];
$(".product").each((_, el) => {
items.push({
name: $(el).find("h3").text().trim(),
price: parseFloat($(el).find(".price").text().replace("$", "")),
currency: "USD",
rating: null,
reviewCount: 0,
url: $(el).find("a").attr("href") ?? "",
inStock: true,
});
});
return items;
}
);
console.log(`Found ${result.data.length} valid products`);
console.log(`Errors: ${result.errors.length}`);
Method 2: Playwright with TypeScript
Type-Safe Page Evaluation
import { chromium, Page, Browser, BrowserContext } from "playwright";
interface PageData {
title: string;
description: string | null;
headings: string[];
links: Array<{ text: string; href: string }>;
}
async function extractPageData(page: Page): Promise<PageData> {
return page.evaluate((): PageData => {
return {
title: document.title,
description:
document.querySelector<HTMLMetaElement>('meta[name="description"]')
?.content ?? null,
headings: Array.from(document.querySelectorAll("h1, h2, h3")).map(
(el) => el.textContent?.trim() ?? ""
),
links: Array.from(document.querySelectorAll<HTMLAnchorElement>("a[href]"))
.slice(0, 50)
.map((a) => ({
text: a.textContent?.trim() ?? "",
href: a.href,
})),
};
});
}
Scraper Class Pattern
import { chromium, Browser, BrowserContext, Page } from "playwright";
import { z } from "zod";
abstract class BaseScraper<T extends z.ZodTypeAny> {
protected browser: Browser | null = null;
protected context: BrowserContext | null = null;
constructor(
protected schema: T,
protected options: {
headless?: boolean;
timeout?: number;
userAgent?: string;
} = {}
) {}
async init(): Promise<void> {
this.browser = await chromium.launch({
headless: this.options.headless ?? true,
});
this.context = await this.browser.newContext({
userAgent:
this.options.userAgent ??
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) " +
"AppleWebKit/537.36 Chrome/122.0.0.0 Safari/537.36",
viewport: { width: 1280, height: 720 },
});
}
abstract extract(page: Page): Promise<unknown[]>;
async scrape(url: string): Promise<z.infer<T>[]> {
if (!this.context) await this.init();
const page = await this.context!.newPage();
try {
await page.goto(url, {
waitUntil: "networkidle",
timeout: this.options.timeout ?? 15000,
});
const raw = await this.extract(page);
const validated: z.infer<T>[] = [];
for (const item of raw) {
const result = this.schema.safeParse(item);
if (result.success) validated.push(result.data);
}
return validated;
} finally {
await page.close();
}
}
async close(): Promise<void> {
if (this.browser) await this.browser.close();
}
}
// Concrete implementation
const JobSchema = z.object({
title: z.string(),
company: z.string(),
location: z.string(),
salary: z.string().nullable(),
});
class JobScraper extends BaseScraper<typeof JobSchema> {
constructor() {
super(JobSchema);
}
async extract(page: Page): Promise<unknown[]> {
return page.evaluate(() =>
Array.from(document.querySelectorAll(".job-card")).map((card) => ({
title: card.querySelector("h3")?.textContent?.trim(),
company: card.querySelector(".company")?.textContent?.trim(),
location: card.querySelector(".location")?.textContent?.trim(),
salary: card.querySelector(".salary")?.textContent?.trim() ?? null,
}))
);
}
}
// Usage
const scraper = new JobScraper();
const jobs = await scraper.scrape("https://example.com/jobs");
// jobs is typed as { title: string; company: string; location: string; salary: string | null }[]
console.log(jobs);
await scraper.close();
Method 3: SimpleCrawl with TypeScript
SimpleCrawl pairs perfectly with TypeScript for type-safe API-based scraping:
import { z } from "zod";
const SIMPLECRAWL_API_KEY = "sc_your_api_key";
interface SimpleCrawlRequest {
url: string;
format: "markdown" | "extract";
schema?: Record<string, unknown>;
render_js?: boolean;
}
interface SimpleCrawlResponse {
url: string;
title: string;
markdown?: string;
data?: Record<string, unknown>;
credits_used: number;
}
async function simpleCrawl(request: SimpleCrawlRequest): Promise<SimpleCrawlResponse> {
const response = await fetch("https://api.simplecrawl.com/v1/scrape", {
method: "POST",
headers: {
Authorization: `Bearer ${SIMPLECRAWL_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify(request),
});
if (!response.ok) {
throw new Error(`SimpleCrawl API error: ${response.status}`);
}
return response.json() as Promise<SimpleCrawlResponse>;
}
// Type-safe extraction with validation
async function extractAndValidate<T extends z.ZodTypeAny>(
url: string,
schema: T,
extractionSchema: Record<string, unknown>
): Promise<z.infer<T>> {
const response = await simpleCrawl({
url,
format: "extract",
schema: extractionSchema,
});
const result = schema.safeParse(response.data);
if (!result.success) {
throw new Error(`Validation failed: ${result.error.message}`);
}
return result.data;
}
// Usage
const CompanySchema = z.object({
name: z.string(),
employees: z.number(),
industry: z.string(),
headquarters: z.string(),
});
const company = await extractAndValidate(
"https://example.com/about",
CompanySchema,
{
name: "string",
employees: "number",
industry: "string",
headquarters: "string",
}
);
// company is fully typed: { name: string; employees: number; industry: string; headquarters: string }
console.log(`${company.name} — ${company.employees} employees`);
See pricing for API credits and limits.
Data Storage Patterns
Type-Safe JSON Output
import { writeFile } from "fs/promises";
async function saveResults<T>(data: T[], filename: string): Promise<void> {
await writeFile(filename, JSON.stringify(data, null, 2), "utf-8");
console.log(`Saved ${data.length} items to ${filename}`);
}
Typed Database Inserts
interface DBRecord {
id?: number;
data: Product;
scrapedAt: Date;
sourceUrl: string;
}
async function upsertProducts(products: Product[], sourceUrl: string): Promise<void> {
const records: DBRecord[] = products.map((product) => ({
data: product,
scrapedAt: new Date(),
sourceUrl,
}));
// Insert into your database of choice
console.log(`Upserting ${records.length} products`);
}
Error Handling Patterns
class ScrapingError extends Error {
constructor(
message: string,
public readonly url: string,
public readonly statusCode?: number,
public readonly cause?: Error
) {
super(message);
this.name = "ScrapingError";
}
}
type Result<T, E = ScrapingError> =
| { success: true; data: T }
| { success: false; error: E };
async function safeScrape<T>(
fn: () => Promise<T>
): Promise<Result<T>> {
try {
const data = await fn();
return { success: true, data };
} catch (error) {
if (error instanceof ScrapingError) {
return { success: false, error };
}
return {
success: false,
error: new ScrapingError(
error instanceof Error ? error.message : "Unknown error",
"unknown"
),
};
}
}
// Usage
const result = await safeScrape(() => scrapeProducts("https://example.com"));
if (result.success) {
console.log(`Got ${result.data.length} products`);
} else {
console.error(`Failed: ${result.error.message}`);
}
Choosing the Right Approach
| Approach | Type Safety | JS Rendering | Scale | Complexity |
|---|---|---|---|---|
| Cheerio + Zod | Full | No | High | Medium |
| Playwright + Types | Full | Yes | Medium | Medium-High |
| SimpleCrawl + Zod | Full | Yes | Unlimited | Low |
For scraping specific sites, check our domain guides: Amazon, LinkedIn, Google. Compare options on our best web scraping APIs page.
FAQ
Is TypeScript slower than JavaScript for web scraping?
No. TypeScript compiles to JavaScript, so runtime performance is identical. The compile step adds a few seconds to development, but tsx makes this negligible.
Should I use Zod or io-ts for schema validation?
Zod is more popular, has a simpler API, and better TypeScript inference. io-ts is more theoretically rigorous. For web scraping, Zod's developer experience wins.
Can I use TypeScript with Scrapy?
Scrapy is Python-only. The closest TypeScript equivalent is Crawlee, which provides similar functionality with full TypeScript support. See our Node.js guide for Crawlee examples.
How do I handle websites that change their HTML structure?
TypeScript's type system catches breakage at compile time if your selectors return unexpected types. Zod's runtime validation catches breakage in production. Together, they provide the most resilient scraping architecture. SimpleCrawl's AI extraction adapts to layout changes automatically.
Is TypeScript overkill for simple scraping scripts?
For one-off scripts, plain JavaScript is fine. For anything you'll maintain, share with a team, or run in production, TypeScript pays for itself quickly through fewer runtime errors and better refactoring support.
Ready to try SimpleCrawl?
We're building the simplest web scraping API for AI. Join the waitlist and get 500 free credits at launch.
More guides
Web Scraping with Go: Colly, Goquery, and Beyond (2026)
Build fast, concurrent web scrapers with Go using Colly and Goquery. Learn high-performance data extraction patterns for production systems.
Web Scraping with JavaScript: Node.js Guide (2026)
Master web scraping with JavaScript using fetch, cheerio, and Puppeteer. Learn practical data extraction techniques for Node.js, plus how SimpleCrawl makes it effortless.
Web Scraping with Node.js: Complete Tutorial (2026)
Build powerful web scrapers with Node.js using Playwright, Crawlee, and async patterns. Learn advanced techniques for data extraction at scale with practical code examples.
Web Scraping with Python: The Complete Guide (2026)
Master web scraping with Python using requests, BeautifulSoup, Playwright, and Scrapy. Learn practical techniques for extracting data from any website, plus how SimpleCrawl simplifies everything.