SimpleCrawl

How to Scrape Reddit — Complete Guide (2026)

Learn how to scrape Reddit posts, comments, and subreddit data. Compare Reddit's API, Python scrapers, and SimpleCrawl for reliable Reddit data extraction.

6 min read

Reddit is one of the internet's richest sources of user-generated content, with over 100,000 active communities covering every topic imaginable. Scraping Reddit data powers sentiment analysis, market research, content discovery, trend detection, and AI training datasets. This guide covers every practical method for extracting Reddit data — from posts and comments to subreddit metadata and user profiles.

What Data Can You Extract from Reddit?

Reddit's structure provides several layers of extractable data:

  • Posts — title, body text, URL, score (upvotes - downvotes), upvote ratio, awards, flair, post type (text/link/image/video)
  • Comments — text, score, author, timestamp, parent comment (full thread hierarchy), depth level
  • Subreddits — name, description, subscriber count, rules, moderator list, sidebar content, wiki pages
  • User profiles — username, karma (post + comment), account age, post history, comment history
  • Search results — posts matching keywords across all subreddits or within specific communities
  • Trending/popular data — hot posts, rising posts, top posts by timeframe

This data feeds content aggregation pipelines, brand monitoring tools, AI training datasets, and competitive research platforms.

Challenges When Scraping Reddit

While Reddit is more accessible than many platforms, scaling presents real challenges:

API Rate Limits (Post-2023)

Reddit drastically restricted free API access in 2023, killing many third-party apps. The official API now requires authentication and enforces strict rate limits: 100 requests/minute for OAuth clients, with content restrictions on bulk access.

Authentication Complexity

Reddit's API requires OAuth2 authentication with client IDs and refresh tokens. The old .json URL trick (appending .json to any Reddit URL) still works but is aggressively rate-limited.

Anti-Bot Measures

Reddit uses Cloudflare protection on its web frontend. Automated requests without proper headers and browser fingerprints get challenged or blocked.

Nested Comment Threads

Reddit's comment system is deeply nested. The API paginates "more comments" links, requiring multiple requests to capture full threads. Long threads can require dozens of follow-up requests.

Content Moderation and Deletion

Posts and comments get deleted, removed by moderators, or edited frequently. Point-in-time scraping misses content changes, and deleted content is irrecoverable through the API.

Method 1: Using SimpleCrawl API (Easiest)

SimpleCrawl handles rendering, anti-bot bypass, and returns clean structured data from any Reddit URL:

curl -X POST https://api.simplecrawl.com/v1/scrape \
  -H "Authorization: Bearer sc_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://www.reddit.com/r/webdev/top/?t=week",
    "format": "markdown"
  }'

For structured post data:

{
  "url": "https://www.reddit.com/r/webdev/top/?t=week",
  "format": "extract",
  "schema": {
    "posts": [{
      "title": "string",
      "author": "string",
      "score": "number",
      "comment_count": "number",
      "url": "string",
      "flair": "string"
    }]
  }
}

For full comment thread extraction:

{
  "url": "https://www.reddit.com/r/programming/comments/abc123/post_title/",
  "format": "extract",
  "schema": {
    "post_title": "string",
    "post_body": "string",
    "comments": [{
      "author": "string",
      "text": "string",
      "score": "number",
      "replies": ["string"]
    }]
  }
}

Method 2: DIY with Python (Manual)

Using Reddit's JSON Endpoints

The simplest approach uses Reddit's built-in JSON endpoints:

import requests
import time

headers = {
    "User-Agent": "MyApp/1.0 (contact: dev@example.com)",
}

def get_subreddit_posts(subreddit: str, sort: str = "hot", limit: int = 25) -> list:
    url = f"https://www.reddit.com/r/{subreddit}/{sort}.json?limit={limit}"
    response = requests.get(url, headers=headers)

    if response.status_code == 429:
        time.sleep(60)
        return get_subreddit_posts(subreddit, sort, limit)

    if response.status_code != 200:
        return []

    data = response.json()
    posts = []
    for child in data["data"]["children"]:
        post = child["data"]
        posts.append({
            "title": post["title"],
            "author": post["author"],
            "score": post["score"],
            "url": post["url"],
            "num_comments": post["num_comments"],
            "created_utc": post["created_utc"],
            "selftext": post.get("selftext", ""),
        })

    return posts

posts = get_subreddit_posts("python", "top", 10)
for p in posts:
    print(f"[{p['score']}] {p['title']} by u/{p['author']}")

Using PRAW (Python Reddit API Wrapper)

import praw

reddit = praw.Reddit(
    client_id="YOUR_CLIENT_ID",
    client_secret="YOUR_CLIENT_SECRET",
    user_agent="MyApp/1.0",
)

subreddit = reddit.subreddit("webdev")
for post in subreddit.hot(limit=10):
    print(f"[{post.score}] {post.title}")
    print(f"  Comments: {post.num_comments}")
    print(f"  URL: {post.url}")
    print()

    post.comments.replace_more(limit=0)
    for comment in post.comments[:5]:
        print(f"    [{comment.score}] {comment.body[:100]}")

Web Scraping with BeautifulSoup (Fallback)

When the API's rate limits are too restrictive:

import requests
from bs4 import BeautifulSoup

def scrape_old_reddit(subreddit: str) -> list:
    url = f"https://old.reddit.com/r/{subreddit}/"
    headers = {"User-Agent": "Mozilla/5.0 (compatible; MyBot/1.0)"}
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, "html.parser")

    posts = []
    for thing in soup.select("div.thing"):
        title_el = thing.select_one("a.title")
        score_el = thing.select_one("div.score.unvoted")
        if title_el:
            posts.append({
                "title": title_el.text,
                "href": title_el.get("href"),
                "score": score_el.text if score_el else "0",
            })
    return posts

For more Python techniques, see our web scraping with Python guide. If you prefer JavaScript, check the Node.js guide.

Why SimpleCrawl Is Better for Reddit

FeatureReddit APIDIY ScrapingSimpleCrawl
Rate limits100 req/minIP-based blocksHigh throughput
Auth requiredYes (OAuth2)No (but fragile)API key only
Full threadsPaginatedPartialComplete
Historical dataLimitedUnavailableCached snapshots
Setup time30+ minutesHoursMinutes

SimpleCrawl combines the best of both approaches — structured data without API restrictions, and full page rendering without building custom scrapers.

  • Reddit's API Terms — Reddit's Data API Terms require attribution and prohibit commercial use without a paid agreement (as of 2024).
  • Public data — posts and comments on public subreddits are publicly accessible. Courts have generally found that scraping public data doesn't violate the CFAA.
  • User privacy — avoid scraping private messages, banned/quarantined subreddits, or correlating usernames to real identities.
  • robots.txt — check Reddit's crawling permissions with our robots.txt checker.
  • AI training data — Reddit licensed its data to Google and OpenAI. Independent scrapers should review Reddit's terms regarding AI training usage.

FAQ

Does appending .json to Reddit URLs still work?

Yes, adding .json to most Reddit URLs returns JSON data. However, Reddit rate-limits this aggressively — you'll get HTTP 429 responses after a few dozen requests. SimpleCrawl provides a more reliable alternative.

Can I scrape Reddit without an API key?

You can scrape old.reddit.com with basic HTTP requests or use the .json endpoint trick, but both are rate-limited. SimpleCrawl requires only a SimpleCrawl API key — no Reddit credentials needed.

How do I scrape all comments in a thread?

Reddit's API paginates comments with "more" links, requiring recursive fetches. SimpleCrawl's extract mode captures the full rendered thread in one request, including deeply nested replies.

This is a gray area. Reddit's terms restrict commercial use of their data without a license. For research or personal projects, scraping public data is generally permissible. Always consult legal counsel for commercial applications.

What subreddit data is most valuable?

Popular data extraction targets include r/wallstreetbets (financial sentiment), r/technology (trend detection), product-specific subreddits (customer feedback), and local subreddits (market research).

Ready to try SimpleCrawl?

We're building the simplest web scraping API for AI. Join the waitlist and get 500 free credits at launch.

More scraping guides

Get early access + 500 free credits