SimpleCrawl

Web Scraping with Python: The Complete Guide (2026)

Master web scraping with Python using requests, BeautifulSoup, Playwright, and Scrapy. Learn practical techniques for extracting data from any website, plus how SimpleCrawl simplifies everything.

9 min read
pythontutorialweb scrapingbeautifulsoupplaywright

Web scraping with Python is the most popular approach to extracting data from websites. Python's rich ecosystem of HTTP clients, HTML parsers, and browser automation libraries makes it the go-to language for data extraction — whether you're building a price tracker, training an AI model, or monitoring competitors. This guide covers every major Python scraping technique, from basic HTTP requests to full browser automation, with real working code examples.

Why Python for Web Scraping?

Python dominates web scraping for several practical reasons:

  • Low barrier to entry — readable syntax and a gentle learning curve make it accessible to non-developers
  • Mature ecosystem — libraries like requests, BeautifulSoup, Scrapy, and Playwright have years of community support
  • Data processing built in — pandas, json, csv modules make it easy to clean and store extracted data
  • Async support — asyncio and aiohttp enable concurrent scraping for high-throughput use cases
  • AI/ML integration — scraped data feeds directly into scikit-learn, PyTorch, and LLM pipelines

If you're coming from another language, check our JavaScript, Node.js, TypeScript, or Go guides.

Setting Up Your Environment

Install the core libraries you'll need:

pip install requests beautifulsoup4 lxml playwright scrapy
playwright install chromium

For a project structure, create a virtual environment:

python -m venv scraper-env
source scraper-env/bin/activate  # macOS/Linux
pip install requests beautifulsoup4 lxml

Method 1: Requests + BeautifulSoup (Static Pages)

The simplest approach works for pages that serve HTML without JavaScript rendering.

Basic Request and Parse

import requests
from bs4 import BeautifulSoup

url = "https://news.ycombinator.com/"
response = requests.get(url)
response.raise_for_status()

soup = BeautifulSoup(response.text, "lxml")

stories = []
for item in soup.select(".athing"):
    title_el = item.select_one(".titleline a")
    rank_el = item.select_one(".rank")
    if title_el:
        stories.append({
            "rank": rank_el.text.strip(".") if rank_el else None,
            "title": title_el.text,
            "url": title_el.get("href"),
        })

for story in stories[:10]:
    print(f"{story['rank']}. {story['title']}")
    print(f"   {story['url']}\n")

Handling Headers and Sessions

Many sites block requests without proper headers. Use a Session for cookie persistence:

session = requests.Session()
session.headers.update({
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/122.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
})

response = session.get("https://example.com")

Pagination

Most sites paginate results. Here's a pattern for handling sequential pages:

import time

def scrape_paginated(base_url: str, pages: int = 5) -> list:
    all_items = []

    for page in range(1, pages + 1):
        url = f"{base_url}?page={page}"
        response = session.get(url)

        if response.status_code != 200:
            print(f"Failed on page {page}: {response.status_code}")
            break

        soup = BeautifulSoup(response.text, "lxml")
        items = soup.select(".item")

        if not items:
            break

        for item in items:
            all_items.append({
                "title": item.select_one("h2").text.strip(),
                "link": item.select_one("a")["href"],
            })

        time.sleep(1.5)  # respect rate limits

    return all_items

Error Handling and Retries

Production scrapers need robust error handling:

from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

def create_session() -> requests.Session:
    session = requests.Session()

    retries = Retry(
        total=3,
        backoff_factor=1,
        status_forcelist=[429, 500, 502, 503, 504],
    )
    adapter = HTTPAdapter(max_retries=retries)
    session.mount("http://", adapter)
    session.mount("https://", adapter)

    session.headers.update({
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                      "AppleWebKit/537.36 Chrome/122.0.0.0 Safari/537.36",
    })

    return session

session = create_session()

try:
    response = session.get("https://example.com", timeout=10)
    response.raise_for_status()
except requests.exceptions.Timeout:
    print("Request timed out")
except requests.exceptions.HTTPError as e:
    print(f"HTTP error: {e}")
except requests.exceptions.ConnectionError:
    print("Connection failed")

Method 2: Playwright (JavaScript-Heavy Sites)

Modern websites render content with JavaScript. BeautifulSoup can't execute JS — you need a headless browser.

Basic Playwright Scraping

from playwright.sync_api import sync_playwright

def scrape_spa(url: str) -> dict:
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto(url, wait_until="networkidle")

        title = page.title()
        content = page.text_content("main")
        links = page.eval_on_selector_all(
            "a[href]",
            "elements => elements.map(e => ({text: e.textContent, href: e.href}))"
        )

        browser.close()
        return {"title": title, "content": content, "links": links}

data = scrape_spa("https://example.com")

Waiting for Dynamic Content

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://example.com/products")

    # Wait for specific element
    page.wait_for_selector("[data-testid='product-card']", timeout=10000)

    # Wait for network to settle
    page.wait_for_load_state("networkidle")

    # Wait for specific text to appear
    page.wait_for_selector("text=Showing results")

    products = page.query_selector_all("[data-testid='product-card']")
    for product in products:
        name = product.query_selector("h3")
        price = product.query_selector(".price")
        print(f"{name.text_content()} — {price.text_content()}")

    browser.close()

Handling Infinite Scroll

from playwright.sync_api import sync_playwright
import time

def scrape_infinite_scroll(url: str, max_scrolls: int = 10) -> list:
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto(url, wait_until="networkidle")

        items = set()
        for _ in range(max_scrolls):
            current = page.query_selector_all(".item")
            for el in current:
                text = el.text_content()
                if text:
                    items.add(text.strip())

            page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
            time.sleep(2)

            new_items = page.query_selector_all(".item")
            if len(new_items) == len(current):
                break

        browser.close()
        return list(items)

Async Playwright for Concurrency

import asyncio
from playwright.async_api import async_playwright

async def scrape_urls(urls: list[str]) -> list[dict]:
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)

        async def scrape_one(url: str) -> dict:
            page = await browser.new_page()
            try:
                await page.goto(url, wait_until="networkidle", timeout=15000)
                title = await page.title()
                text = await page.text_content("body")
                return {"url": url, "title": title, "length": len(text or "")}
            except Exception as e:
                return {"url": url, "error": str(e)}
            finally:
                await page.close()

        results = await asyncio.gather(*[scrape_one(u) for u in urls])
        await browser.close()
        return results

urls = [
    "https://example.com/page1",
    "https://example.com/page2",
    "https://example.com/page3",
]
results = asyncio.run(scrape_urls(urls))

Method 3: Scrapy (Large-Scale Crawling)

For crawling entire sites with thousands of pages, Scrapy provides a framework with built-in concurrency, caching, and middleware:

import scrapy

class ProductSpider(scrapy.Spider):
    name = "products"
    start_urls = ["https://example.com/products"]

    custom_settings = {
        "CONCURRENT_REQUESTS": 8,
        "DOWNLOAD_DELAY": 1,
        "ROBOTSTXT_OBEY": True,
        "USER_AGENT": "MyScraperBot/1.0",
    }

    def parse(self, response):
        for product in response.css("div.product-card"):
            yield {
                "name": product.css("h3::text").get(),
                "price": product.css(".price::text").get(),
                "url": response.urljoin(product.css("a::attr(href)").get()),
            }

        next_page = response.css("a.next-page::attr(href)").get()
        if next_page:
            yield response.follow(next_page, self.parse)

Run it:

scrapy runspider product_spider.py -o products.json

Scrapy excels at crawling but doesn't handle JavaScript rendering natively. Pair it with scrapy-playwright for dynamic sites.

Method 4: SimpleCrawl API (Production-Ready)

All the methods above require managing proxies, headers, CAPTCHAs, and browser instances. SimpleCrawl handles this in one API call:

import requests

SIMPLECRAWL_API_KEY = "sc_your_api_key"

def scrape(url: str, output_format: str = "markdown") -> dict:
    response = requests.post(
        "https://api.simplecrawl.com/v1/scrape",
        headers={
            "Authorization": f"Bearer {SIMPLECRAWL_API_KEY}",
            "Content-Type": "application/json",
        },
        json={"url": url, "format": output_format},
    )
    return response.json()

# Get clean markdown from any page
result = scrape("https://example.com/products")
print(result["markdown"])

Structured Data Extraction

def extract_structured(url: str, schema: dict) -> dict:
    response = requests.post(
        "https://api.simplecrawl.com/v1/scrape",
        headers={
            "Authorization": f"Bearer {SIMPLECRAWL_API_KEY}",
            "Content-Type": "application/json",
        },
        json={
            "url": url,
            "format": "extract",
            "schema": schema,
        },
    )
    return response.json()

products = extract_structured(
    "https://example.com/products",
    schema={
        "products": [{
            "name": "string",
            "price": "number",
            "rating": "number",
            "in_stock": "boolean",
        }]
    },
)

Batch Scraping

import asyncio
import aiohttp

async def batch_scrape(urls: list[str]) -> list[dict]:
    async with aiohttp.ClientSession() as session:
        async def fetch(url: str) -> dict:
            async with session.post(
                "https://api.simplecrawl.com/v1/scrape",
                headers={
                    "Authorization": f"Bearer {SIMPLECRAWL_API_KEY}",
                    "Content-Type": "application/json",
                },
                json={"url": url, "format": "markdown"},
            ) as resp:
                return await resp.json()

        return await asyncio.gather(*[fetch(u) for u in urls])

Check the pricing page for credit costs and rate limits.

Storing Scraped Data

Saving to CSV

import csv

def save_to_csv(data: list[dict], filename: str):
    if not data:
        return
    with open(filename, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=data[0].keys())
        writer.writeheader()
        writer.writerows(data)

save_to_csv(stories, "hackernews.csv")

Saving to JSON

import json

with open("data.json", "w") as f:
    json.dump(stories, f, indent=2, ensure_ascii=False)

Saving to SQLite

import sqlite3

def save_to_sqlite(data: list[dict], db_name: str, table: str):
    conn = sqlite3.connect(db_name)
    cursor = conn.cursor()

    if data:
        columns = ", ".join(f"{k} TEXT" for k in data[0].keys())
        cursor.execute(f"CREATE TABLE IF NOT EXISTS {table} ({columns})")

        placeholders = ", ".join("?" * len(data[0]))
        for row in data:
            cursor.execute(f"INSERT INTO {table} VALUES ({placeholders})", list(row.values()))

    conn.commit()
    conn.close()

Handling Common Anti-Bot Measures

Rotating User Agents

import random

USER_AGENTS = [
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 Chrome/122.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/122.0.0.0 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/122.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 Safari/605.1.15",
]

session.headers["User-Agent"] = random.choice(USER_AGENTS)

Using Proxies

proxies = {
    "http": "http://user:pass@proxy.example.com:8080",
    "https": "http://user:pass@proxy.example.com:8080",
}

response = session.get(url, proxies=proxies)

Rate Limiting with Backoff

import time
import random

def polite_request(session, url, max_retries=3):
    for attempt in range(max_retries):
        response = session.get(url)

        if response.status_code == 429:
            wait = 2 ** attempt + random.uniform(0, 1)
            print(f"Rate limited. Waiting {wait:.1f}s...")
            time.sleep(wait)
            continue

        return response

    return None

Choosing the Right Approach

ApproachBest ForJS SupportScaleDifficulty
requests + BS4Static HTML pagesNoMediumEasy
PlaywrightJavaScript SPAsYesLow-MediumMedium
ScrapyLarge site crawlsWith pluginHighMedium-High
SimpleCrawl APIEverythingYesUnlimitedEasy

For most production use cases, SimpleCrawl eliminates the need to manage infrastructure. Start scraping specific sites with our domain guides or compare options on our comparison page.

  • Check robots.txt — use our robots.txt checker to review crawling permissions
  • Respect rate limits — add delays between requests (1-2 seconds minimum)
  • Don't scrape personal data — avoid PII unless you have a legal basis (see glossary: web scraping)
  • Cache results — don't re-scrape data you already have
  • Identify your bot — set a descriptive User-Agent for ethical scrapers

FAQ

What's the best Python library for web scraping in 2026?

For static pages, requests + BeautifulSoup remains the gold standard. For JavaScript-rendered sites, Playwright is the best option. For the easiest path, SimpleCrawl's API requires zero infrastructure.

Scraping publicly accessible data is generally legal (see hiQ v. LinkedIn). However, always respect Terms of Service, avoid scraping personal data without consent, and consult legal counsel for commercial applications.

How do I scrape a site that requires JavaScript?

Use Playwright (playwright package) to automate a headless browser, or use SimpleCrawl which handles JS rendering automatically. BeautifulSoup alone cannot execute JavaScript.

How fast can Python scrape websites?

A single-threaded requests scraper processes about 1-2 pages/second. Async approaches with aiohttp can reach 50+ pages/second. Scrapy handles 100+ pages/second. SimpleCrawl's infrastructure scales independently of your code.

Should I use Selenium or Playwright for Python scraping?

Playwright is the recommended choice in 2026. It's faster, supports multiple browsers, has better async support, and provides more reliable wait mechanisms than Selenium.

Ready to try SimpleCrawl?

We're building the simplest web scraping API for AI. Join the waitlist and get 500 free credits at launch.

More guides

Get early access + 500 free credits