Web Scraping with Python: The Complete Guide (2026)
Master web scraping with Python using requests, BeautifulSoup, Playwright, and Scrapy. Learn practical techniques for extracting data from any website, plus how SimpleCrawl simplifies everything.
Web scraping with Python is the most popular approach to extracting data from websites. Python's rich ecosystem of HTTP clients, HTML parsers, and browser automation libraries makes it the go-to language for data extraction — whether you're building a price tracker, training an AI model, or monitoring competitors. This guide covers every major Python scraping technique, from basic HTTP requests to full browser automation, with real working code examples.
Why Python for Web Scraping?
Python dominates web scraping for several practical reasons:
- Low barrier to entry — readable syntax and a gentle learning curve make it accessible to non-developers
- Mature ecosystem — libraries like requests, BeautifulSoup, Scrapy, and Playwright have years of community support
- Data processing built in — pandas, json, csv modules make it easy to clean and store extracted data
- Async support — asyncio and aiohttp enable concurrent scraping for high-throughput use cases
- AI/ML integration — scraped data feeds directly into scikit-learn, PyTorch, and LLM pipelines
If you're coming from another language, check our JavaScript, Node.js, TypeScript, or Go guides.
Setting Up Your Environment
Install the core libraries you'll need:
pip install requests beautifulsoup4 lxml playwright scrapy
playwright install chromium
For a project structure, create a virtual environment:
python -m venv scraper-env
source scraper-env/bin/activate # macOS/Linux
pip install requests beautifulsoup4 lxml
Method 1: Requests + BeautifulSoup (Static Pages)
The simplest approach works for pages that serve HTML without JavaScript rendering.
Basic Request and Parse
import requests
from bs4 import BeautifulSoup
url = "https://news.ycombinator.com/"
response = requests.get(url)
response.raise_for_status()
soup = BeautifulSoup(response.text, "lxml")
stories = []
for item in soup.select(".athing"):
title_el = item.select_one(".titleline a")
rank_el = item.select_one(".rank")
if title_el:
stories.append({
"rank": rank_el.text.strip(".") if rank_el else None,
"title": title_el.text,
"url": title_el.get("href"),
})
for story in stories[:10]:
print(f"{story['rank']}. {story['title']}")
print(f" {story['url']}\n")
Handling Headers and Sessions
Many sites block requests without proper headers. Use a Session for cookie persistence:
session = requests.Session()
session.headers.update({
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/122.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
})
response = session.get("https://example.com")
Pagination
Most sites paginate results. Here's a pattern for handling sequential pages:
import time
def scrape_paginated(base_url: str, pages: int = 5) -> list:
all_items = []
for page in range(1, pages + 1):
url = f"{base_url}?page={page}"
response = session.get(url)
if response.status_code != 200:
print(f"Failed on page {page}: {response.status_code}")
break
soup = BeautifulSoup(response.text, "lxml")
items = soup.select(".item")
if not items:
break
for item in items:
all_items.append({
"title": item.select_one("h2").text.strip(),
"link": item.select_one("a")["href"],
})
time.sleep(1.5) # respect rate limits
return all_items
Error Handling and Retries
Production scrapers need robust error handling:
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
def create_session() -> requests.Session:
session = requests.Session()
retries = Retry(
total=3,
backoff_factor=1,
status_forcelist=[429, 500, 502, 503, 504],
)
adapter = HTTPAdapter(max_retries=retries)
session.mount("http://", adapter)
session.mount("https://", adapter)
session.headers.update({
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 Chrome/122.0.0.0 Safari/537.36",
})
return session
session = create_session()
try:
response = session.get("https://example.com", timeout=10)
response.raise_for_status()
except requests.exceptions.Timeout:
print("Request timed out")
except requests.exceptions.HTTPError as e:
print(f"HTTP error: {e}")
except requests.exceptions.ConnectionError:
print("Connection failed")
Method 2: Playwright (JavaScript-Heavy Sites)
Modern websites render content with JavaScript. BeautifulSoup can't execute JS — you need a headless browser.
Basic Playwright Scraping
from playwright.sync_api import sync_playwright
def scrape_spa(url: str) -> dict:
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto(url, wait_until="networkidle")
title = page.title()
content = page.text_content("main")
links = page.eval_on_selector_all(
"a[href]",
"elements => elements.map(e => ({text: e.textContent, href: e.href}))"
)
browser.close()
return {"title": title, "content": content, "links": links}
data = scrape_spa("https://example.com")
Waiting for Dynamic Content
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://example.com/products")
# Wait for specific element
page.wait_for_selector("[data-testid='product-card']", timeout=10000)
# Wait for network to settle
page.wait_for_load_state("networkidle")
# Wait for specific text to appear
page.wait_for_selector("text=Showing results")
products = page.query_selector_all("[data-testid='product-card']")
for product in products:
name = product.query_selector("h3")
price = product.query_selector(".price")
print(f"{name.text_content()} — {price.text_content()}")
browser.close()
Handling Infinite Scroll
from playwright.sync_api import sync_playwright
import time
def scrape_infinite_scroll(url: str, max_scrolls: int = 10) -> list:
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto(url, wait_until="networkidle")
items = set()
for _ in range(max_scrolls):
current = page.query_selector_all(".item")
for el in current:
text = el.text_content()
if text:
items.add(text.strip())
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
time.sleep(2)
new_items = page.query_selector_all(".item")
if len(new_items) == len(current):
break
browser.close()
return list(items)
Async Playwright for Concurrency
import asyncio
from playwright.async_api import async_playwright
async def scrape_urls(urls: list[str]) -> list[dict]:
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
async def scrape_one(url: str) -> dict:
page = await browser.new_page()
try:
await page.goto(url, wait_until="networkidle", timeout=15000)
title = await page.title()
text = await page.text_content("body")
return {"url": url, "title": title, "length": len(text or "")}
except Exception as e:
return {"url": url, "error": str(e)}
finally:
await page.close()
results = await asyncio.gather(*[scrape_one(u) for u in urls])
await browser.close()
return results
urls = [
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3",
]
results = asyncio.run(scrape_urls(urls))
Method 3: Scrapy (Large-Scale Crawling)
For crawling entire sites with thousands of pages, Scrapy provides a framework with built-in concurrency, caching, and middleware:
import scrapy
class ProductSpider(scrapy.Spider):
name = "products"
start_urls = ["https://example.com/products"]
custom_settings = {
"CONCURRENT_REQUESTS": 8,
"DOWNLOAD_DELAY": 1,
"ROBOTSTXT_OBEY": True,
"USER_AGENT": "MyScraperBot/1.0",
}
def parse(self, response):
for product in response.css("div.product-card"):
yield {
"name": product.css("h3::text").get(),
"price": product.css(".price::text").get(),
"url": response.urljoin(product.css("a::attr(href)").get()),
}
next_page = response.css("a.next-page::attr(href)").get()
if next_page:
yield response.follow(next_page, self.parse)
Run it:
scrapy runspider product_spider.py -o products.json
Scrapy excels at crawling but doesn't handle JavaScript rendering natively. Pair it with scrapy-playwright for dynamic sites.
Method 4: SimpleCrawl API (Production-Ready)
All the methods above require managing proxies, headers, CAPTCHAs, and browser instances. SimpleCrawl handles this in one API call:
import requests
SIMPLECRAWL_API_KEY = "sc_your_api_key"
def scrape(url: str, output_format: str = "markdown") -> dict:
response = requests.post(
"https://api.simplecrawl.com/v1/scrape",
headers={
"Authorization": f"Bearer {SIMPLECRAWL_API_KEY}",
"Content-Type": "application/json",
},
json={"url": url, "format": output_format},
)
return response.json()
# Get clean markdown from any page
result = scrape("https://example.com/products")
print(result["markdown"])
Structured Data Extraction
def extract_structured(url: str, schema: dict) -> dict:
response = requests.post(
"https://api.simplecrawl.com/v1/scrape",
headers={
"Authorization": f"Bearer {SIMPLECRAWL_API_KEY}",
"Content-Type": "application/json",
},
json={
"url": url,
"format": "extract",
"schema": schema,
},
)
return response.json()
products = extract_structured(
"https://example.com/products",
schema={
"products": [{
"name": "string",
"price": "number",
"rating": "number",
"in_stock": "boolean",
}]
},
)
Batch Scraping
import asyncio
import aiohttp
async def batch_scrape(urls: list[str]) -> list[dict]:
async with aiohttp.ClientSession() as session:
async def fetch(url: str) -> dict:
async with session.post(
"https://api.simplecrawl.com/v1/scrape",
headers={
"Authorization": f"Bearer {SIMPLECRAWL_API_KEY}",
"Content-Type": "application/json",
},
json={"url": url, "format": "markdown"},
) as resp:
return await resp.json()
return await asyncio.gather(*[fetch(u) for u in urls])
Check the pricing page for credit costs and rate limits.
Storing Scraped Data
Saving to CSV
import csv
def save_to_csv(data: list[dict], filename: str):
if not data:
return
with open(filename, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=data[0].keys())
writer.writeheader()
writer.writerows(data)
save_to_csv(stories, "hackernews.csv")
Saving to JSON
import json
with open("data.json", "w") as f:
json.dump(stories, f, indent=2, ensure_ascii=False)
Saving to SQLite
import sqlite3
def save_to_sqlite(data: list[dict], db_name: str, table: str):
conn = sqlite3.connect(db_name)
cursor = conn.cursor()
if data:
columns = ", ".join(f"{k} TEXT" for k in data[0].keys())
cursor.execute(f"CREATE TABLE IF NOT EXISTS {table} ({columns})")
placeholders = ", ".join("?" * len(data[0]))
for row in data:
cursor.execute(f"INSERT INTO {table} VALUES ({placeholders})", list(row.values()))
conn.commit()
conn.close()
Handling Common Anti-Bot Measures
Rotating User Agents
import random
USER_AGENTS = [
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 Chrome/122.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/122.0.0.0 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/122.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 Safari/605.1.15",
]
session.headers["User-Agent"] = random.choice(USER_AGENTS)
Using Proxies
proxies = {
"http": "http://user:pass@proxy.example.com:8080",
"https": "http://user:pass@proxy.example.com:8080",
}
response = session.get(url, proxies=proxies)
Rate Limiting with Backoff
import time
import random
def polite_request(session, url, max_retries=3):
for attempt in range(max_retries):
response = session.get(url)
if response.status_code == 429:
wait = 2 ** attempt + random.uniform(0, 1)
print(f"Rate limited. Waiting {wait:.1f}s...")
time.sleep(wait)
continue
return response
return None
Choosing the Right Approach
| Approach | Best For | JS Support | Scale | Difficulty |
|---|---|---|---|---|
| requests + BS4 | Static HTML pages | No | Medium | Easy |
| Playwright | JavaScript SPAs | Yes | Low-Medium | Medium |
| Scrapy | Large site crawls | With plugin | High | Medium-High |
| SimpleCrawl API | Everything | Yes | Unlimited | Easy |
For most production use cases, SimpleCrawl eliminates the need to manage infrastructure. Start scraping specific sites with our domain guides or compare options on our comparison page.
Legal and Ethical Best Practices
- Check robots.txt — use our robots.txt checker to review crawling permissions
- Respect rate limits — add delays between requests (1-2 seconds minimum)
- Don't scrape personal data — avoid PII unless you have a legal basis (see glossary: web scraping)
- Cache results — don't re-scrape data you already have
- Identify your bot — set a descriptive User-Agent for ethical scrapers
FAQ
What's the best Python library for web scraping in 2026?
For static pages, requests + BeautifulSoup remains the gold standard. For JavaScript-rendered sites, Playwright is the best option. For the easiest path, SimpleCrawl's API requires zero infrastructure.
Is web scraping with Python legal?
Scraping publicly accessible data is generally legal (see hiQ v. LinkedIn). However, always respect Terms of Service, avoid scraping personal data without consent, and consult legal counsel for commercial applications.
How do I scrape a site that requires JavaScript?
Use Playwright (playwright package) to automate a headless browser, or use SimpleCrawl which handles JS rendering automatically. BeautifulSoup alone cannot execute JavaScript.
How fast can Python scrape websites?
A single-threaded requests scraper processes about 1-2 pages/second. Async approaches with aiohttp can reach 50+ pages/second. Scrapy handles 100+ pages/second. SimpleCrawl's infrastructure scales independently of your code.
Should I use Selenium or Playwright for Python scraping?
Playwright is the recommended choice in 2026. It's faster, supports multiple browsers, has better async support, and provides more reliable wait mechanisms than Selenium.
Ready to try SimpleCrawl?
We're building the simplest web scraping API for AI. Join the waitlist and get 500 free credits at launch.
More guides
Web Scraping with Go: Colly, Goquery, and Beyond (2026)
Build fast, concurrent web scrapers with Go using Colly and Goquery. Learn high-performance data extraction patterns for production systems.
Web Scraping with JavaScript: Node.js Guide (2026)
Master web scraping with JavaScript using fetch, cheerio, and Puppeteer. Learn practical data extraction techniques for Node.js, plus how SimpleCrawl makes it effortless.
Web Scraping with Node.js: Complete Tutorial (2026)
Build powerful web scrapers with Node.js using Playwright, Crawlee, and async patterns. Learn advanced techniques for data extraction at scale with practical code examples.
Web Scraping with TypeScript: Type-Safe Data Extraction (2026)
Build reliable web scrapers with TypeScript using typed schemas, Zod validation, Playwright, and Cheerio. Catch scraping bugs at compile time, not in production.