SimpleCrawl

Web Scraping with TypeScript: Type-Safe Data Extraction (2026)

Build reliable web scrapers with TypeScript using typed schemas, Zod validation, Playwright, and Cheerio. Catch scraping bugs at compile time, not in production.

9 min read
typescripttutorialweb scrapingtype safetyzod

Web scraping with TypeScript adds type safety to data extraction, catching structural errors at compile time rather than in production. When a website changes its layout and your selectors break, TypeScript's type system ensures you handle missing data gracefully instead of silently passing undefined through your pipeline. This guide covers type-safe scraping patterns, runtime validation with Zod, and production-ready TypeScript scraper architectures.

Why TypeScript for Web Scraping?

TypeScript transforms web scraping from fragile scripts into maintainable software:

  • Typed schemas — define the shape of extracted data and catch structural mismatches early
  • Refactoring safety — rename a field and the compiler shows every location that needs updating
  • Runtime validation — combine TypeScript types with Zod for compile-time AND runtime safety
  • IDE support — autocomplete for extracted data properties, inline documentation, error highlighting
  • Team collaboration — shared types make scraper output predictable across services

For JavaScript-only approaches, see our JavaScript guide or Node.js guide. For Python, check our Python guide.

Project Setup

mkdir ts-scraper && cd ts-scraper
npm init -y
npm install typescript tsx cheerio playwright zod
npm install -D @types/node
npx playwright install chromium
npx tsc --init

Configure tsconfig.json:

{
  "compilerOptions": {
    "target": "ES2022",
    "module": "ESNext",
    "moduleResolution": "bundler",
    "strict": true,
    "esModuleInterop": true,
    "outDir": "dist",
    "rootDir": "src",
    "declaration": true
  },
  "include": ["src"]
}

Run scripts with: npx tsx src/index.ts

Defining Typed Schemas

The foundation of type-safe scraping is defining your data types upfront.

Basic Type Definitions

interface Product {
  name: string;
  price: number;
  currency: string;
  rating: number | null;
  reviewCount: number;
  url: string;
  inStock: boolean;
}

interface ScrapeResult<T> {
  data: T[];
  url: string;
  scrapedAt: string;
  duration: number;
  errors: string[];
}

Zod Schemas for Runtime Validation

TypeScript types disappear at runtime. Zod provides runtime validation that mirrors your TypeScript types:

import { z } from "zod";

const ProductSchema = z.object({
  name: z.string().min(1),
  price: z.number().positive(),
  currency: z.string().length(3),
  rating: z.number().min(0).max(5).nullable(),
  reviewCount: z.number().int().nonnegative(),
  url: z.string().url(),
  inStock: z.boolean(),
});

type Product = z.infer<typeof ProductSchema>;

const ProductListSchema = z.array(ProductSchema);

function validateProducts(raw: unknown[]): Product[] {
  const validated: Product[] = [];
  const errors: string[] = [];

  for (const item of raw) {
    const result = ProductSchema.safeParse(item);
    if (result.success) {
      validated.push(result.data);
    } else {
      errors.push(
        `Validation failed: ${result.error.issues.map((i) => i.message).join(", ")}`
      );
    }
  }

  if (errors.length > 0) {
    console.warn(`${errors.length} items failed validation:`, errors);
  }

  return validated;
}

Method 1: Cheerio with Type Safety

Typed Scraper Function

import * as cheerio from "cheerio";

interface ScraperConfig {
  url: string;
  headers?: Record<string, string>;
  timeout?: number;
}

async function fetchAndParse(config: ScraperConfig): Promise<cheerio.CheerioAPI> {
  const { url, headers = {}, timeout = 10000 } = config;

  const controller = new AbortController();
  const timer = setTimeout(() => controller.abort(), timeout);

  try {
    const response = await fetch(url, {
      headers: {
        "User-Agent":
          "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) " +
          "AppleWebKit/537.36 Chrome/122.0.0.0 Safari/537.36",
        ...headers,
      },
      signal: controller.signal,
    });

    if (!response.ok) {
      throw new Error(`HTTP ${response.status}: ${response.statusText}`);
    }

    const html = await response.text();
    return cheerio.load(html);
  } finally {
    clearTimeout(timer);
  }
}

async function scrapeProducts(url: string): Promise<Product[]> {
  const $ = await fetchAndParse({ url });
  const rawProducts: unknown[] = [];

  $(".product-card").each((_, element) => {
    const $el = $(element);
    const priceText = $el.find(".price").text().replace(/[^0-9.]/g, "");
    const ratingText = $el.find(".rating").text();

    rawProducts.push({
      name: $el.find("h3").text().trim(),
      price: parseFloat(priceText) || 0,
      currency: "USD",
      rating: ratingText ? parseFloat(ratingText) : null,
      reviewCount: parseInt($el.find(".reviews").text()) || 0,
      url: $el.find("a").attr("href") || "",
      inStock: !$el.hasClass("out-of-stock"),
    });
  });

  return validateProducts(rawProducts);
}

Generic Scraper Pattern

type Extractor<T> = ($: cheerio.CheerioAPI) => T[];

function createScraper<T extends z.ZodTypeAny>(schema: T) {
  type OutputType = z.infer<T>;

  return async function scrape(
    url: string,
    extractor: Extractor<unknown>
  ): Promise<ScrapeResult<OutputType>> {
    const start = Date.now();
    const errors: string[] = [];

    const $ = await fetchAndParse({ url });
    const rawItems = extractor($);

    const data: OutputType[] = [];
    for (const item of rawItems) {
      const result = schema.safeParse(item);
      if (result.success) {
        data.push(result.data);
      } else {
        errors.push(result.error.message);
      }
    }

    return {
      data,
      url,
      scrapedAt: new Date().toISOString(),
      duration: Date.now() - start,
      errors,
    };
  };
}

// Usage
const scrapeProducts = createScraper(ProductSchema);

const result = await scrapeProducts(
  "https://example.com/products",
  ($) => {
    const items: unknown[] = [];
    $(".product").each((_, el) => {
      items.push({
        name: $(el).find("h3").text().trim(),
        price: parseFloat($(el).find(".price").text().replace("$", "")),
        currency: "USD",
        rating: null,
        reviewCount: 0,
        url: $(el).find("a").attr("href") ?? "",
        inStock: true,
      });
    });
    return items;
  }
);

console.log(`Found ${result.data.length} valid products`);
console.log(`Errors: ${result.errors.length}`);

Method 2: Playwright with TypeScript

Type-Safe Page Evaluation

import { chromium, Page, Browser, BrowserContext } from "playwright";

interface PageData {
  title: string;
  description: string | null;
  headings: string[];
  links: Array<{ text: string; href: string }>;
}

async function extractPageData(page: Page): Promise<PageData> {
  return page.evaluate((): PageData => {
    return {
      title: document.title,
      description:
        document.querySelector<HTMLMetaElement>('meta[name="description"]')
          ?.content ?? null,
      headings: Array.from(document.querySelectorAll("h1, h2, h3")).map(
        (el) => el.textContent?.trim() ?? ""
      ),
      links: Array.from(document.querySelectorAll<HTMLAnchorElement>("a[href]"))
        .slice(0, 50)
        .map((a) => ({
          text: a.textContent?.trim() ?? "",
          href: a.href,
        })),
    };
  });
}

Scraper Class Pattern

import { chromium, Browser, BrowserContext, Page } from "playwright";
import { z } from "zod";

abstract class BaseScraper<T extends z.ZodTypeAny> {
  protected browser: Browser | null = null;
  protected context: BrowserContext | null = null;

  constructor(
    protected schema: T,
    protected options: {
      headless?: boolean;
      timeout?: number;
      userAgent?: string;
    } = {}
  ) {}

  async init(): Promise<void> {
    this.browser = await chromium.launch({
      headless: this.options.headless ?? true,
    });
    this.context = await this.browser.newContext({
      userAgent:
        this.options.userAgent ??
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) " +
          "AppleWebKit/537.36 Chrome/122.0.0.0 Safari/537.36",
      viewport: { width: 1280, height: 720 },
    });
  }

  abstract extract(page: Page): Promise<unknown[]>;

  async scrape(url: string): Promise<z.infer<T>[]> {
    if (!this.context) await this.init();

    const page = await this.context!.newPage();
    try {
      await page.goto(url, {
        waitUntil: "networkidle",
        timeout: this.options.timeout ?? 15000,
      });

      const raw = await this.extract(page);
      const validated: z.infer<T>[] = [];

      for (const item of raw) {
        const result = this.schema.safeParse(item);
        if (result.success) validated.push(result.data);
      }

      return validated;
    } finally {
      await page.close();
    }
  }

  async close(): Promise<void> {
    if (this.browser) await this.browser.close();
  }
}

// Concrete implementation
const JobSchema = z.object({
  title: z.string(),
  company: z.string(),
  location: z.string(),
  salary: z.string().nullable(),
});

class JobScraper extends BaseScraper<typeof JobSchema> {
  constructor() {
    super(JobSchema);
  }

  async extract(page: Page): Promise<unknown[]> {
    return page.evaluate(() =>
      Array.from(document.querySelectorAll(".job-card")).map((card) => ({
        title: card.querySelector("h3")?.textContent?.trim(),
        company: card.querySelector(".company")?.textContent?.trim(),
        location: card.querySelector(".location")?.textContent?.trim(),
        salary: card.querySelector(".salary")?.textContent?.trim() ?? null,
      }))
    );
  }
}

// Usage
const scraper = new JobScraper();
const jobs = await scraper.scrape("https://example.com/jobs");
// jobs is typed as { title: string; company: string; location: string; salary: string | null }[]
console.log(jobs);
await scraper.close();

Method 3: SimpleCrawl with TypeScript

SimpleCrawl pairs perfectly with TypeScript for type-safe API-based scraping:

import { z } from "zod";

const SIMPLECRAWL_API_KEY = "sc_your_api_key";

interface SimpleCrawlRequest {
  url: string;
  format: "markdown" | "extract";
  schema?: Record<string, unknown>;
  render_js?: boolean;
}

interface SimpleCrawlResponse {
  url: string;
  title: string;
  markdown?: string;
  data?: Record<string, unknown>;
  credits_used: number;
}

async function simpleCrawl(request: SimpleCrawlRequest): Promise<SimpleCrawlResponse> {
  const response = await fetch("https://api.simplecrawl.com/v1/scrape", {
    method: "POST",
    headers: {
      Authorization: `Bearer ${SIMPLECRAWL_API_KEY}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify(request),
  });

  if (!response.ok) {
    throw new Error(`SimpleCrawl API error: ${response.status}`);
  }

  return response.json() as Promise<SimpleCrawlResponse>;
}

// Type-safe extraction with validation
async function extractAndValidate<T extends z.ZodTypeAny>(
  url: string,
  schema: T,
  extractionSchema: Record<string, unknown>
): Promise<z.infer<T>> {
  const response = await simpleCrawl({
    url,
    format: "extract",
    schema: extractionSchema,
  });

  const result = schema.safeParse(response.data);
  if (!result.success) {
    throw new Error(`Validation failed: ${result.error.message}`);
  }

  return result.data;
}

// Usage
const CompanySchema = z.object({
  name: z.string(),
  employees: z.number(),
  industry: z.string(),
  headquarters: z.string(),
});

const company = await extractAndValidate(
  "https://example.com/about",
  CompanySchema,
  {
    name: "string",
    employees: "number",
    industry: "string",
    headquarters: "string",
  }
);

// company is fully typed: { name: string; employees: number; industry: string; headquarters: string }
console.log(`${company.name} — ${company.employees} employees`);

See pricing for API credits and limits.

Data Storage Patterns

Type-Safe JSON Output

import { writeFile } from "fs/promises";

async function saveResults<T>(data: T[], filename: string): Promise<void> {
  await writeFile(filename, JSON.stringify(data, null, 2), "utf-8");
  console.log(`Saved ${data.length} items to ${filename}`);
}

Typed Database Inserts

interface DBRecord {
  id?: number;
  data: Product;
  scrapedAt: Date;
  sourceUrl: string;
}

async function upsertProducts(products: Product[], sourceUrl: string): Promise<void> {
  const records: DBRecord[] = products.map((product) => ({
    data: product,
    scrapedAt: new Date(),
    sourceUrl,
  }));

  // Insert into your database of choice
  console.log(`Upserting ${records.length} products`);
}

Error Handling Patterns

class ScrapingError extends Error {
  constructor(
    message: string,
    public readonly url: string,
    public readonly statusCode?: number,
    public readonly cause?: Error
  ) {
    super(message);
    this.name = "ScrapingError";
  }
}

type Result<T, E = ScrapingError> =
  | { success: true; data: T }
  | { success: false; error: E };

async function safeScrape<T>(
  fn: () => Promise<T>
): Promise<Result<T>> {
  try {
    const data = await fn();
    return { success: true, data };
  } catch (error) {
    if (error instanceof ScrapingError) {
      return { success: false, error };
    }
    return {
      success: false,
      error: new ScrapingError(
        error instanceof Error ? error.message : "Unknown error",
        "unknown"
      ),
    };
  }
}

// Usage
const result = await safeScrape(() => scrapeProducts("https://example.com"));
if (result.success) {
  console.log(`Got ${result.data.length} products`);
} else {
  console.error(`Failed: ${result.error.message}`);
}

Choosing the Right Approach

ApproachType SafetyJS RenderingScaleComplexity
Cheerio + ZodFullNoHighMedium
Playwright + TypesFullYesMediumMedium-High
SimpleCrawl + ZodFullYesUnlimitedLow

For scraping specific sites, check our domain guides: Amazon, LinkedIn, Google. Compare options on our best web scraping APIs page.

FAQ

Is TypeScript slower than JavaScript for web scraping?

No. TypeScript compiles to JavaScript, so runtime performance is identical. The compile step adds a few seconds to development, but tsx makes this negligible.

Should I use Zod or io-ts for schema validation?

Zod is more popular, has a simpler API, and better TypeScript inference. io-ts is more theoretically rigorous. For web scraping, Zod's developer experience wins.

Can I use TypeScript with Scrapy?

Scrapy is Python-only. The closest TypeScript equivalent is Crawlee, which provides similar functionality with full TypeScript support. See our Node.js guide for Crawlee examples.

How do I handle websites that change their HTML structure?

TypeScript's type system catches breakage at compile time if your selectors return unexpected types. Zod's runtime validation catches breakage in production. Together, they provide the most resilient scraping architecture. SimpleCrawl's AI extraction adapts to layout changes automatically.

Is TypeScript overkill for simple scraping scripts?

For one-off scripts, plain JavaScript is fine. For anything you'll maintain, share with a team, or run in production, TypeScript pays for itself quickly through fewer runtime errors and better refactoring support.

Ready to try SimpleCrawl?

We're building the simplest web scraping API for AI. Join the waitlist and get 500 free credits at launch.

More guides

Get early access + 500 free credits