How to Scrape Reddit — Complete Guide (2026)
Learn how to scrape Reddit posts, comments, and subreddit data. Compare Reddit's API, Python scrapers, and SimpleCrawl for reliable Reddit data extraction.
Reddit is one of the internet's richest sources of user-generated content, with over 100,000 active communities covering every topic imaginable. Scraping Reddit data powers sentiment analysis, market research, content discovery, trend detection, and AI training datasets. This guide covers every practical method for extracting Reddit data — from posts and comments to subreddit metadata and user profiles.
What Data Can You Extract from Reddit?
Reddit's structure provides several layers of extractable data:
- Posts — title, body text, URL, score (upvotes - downvotes), upvote ratio, awards, flair, post type (text/link/image/video)
- Comments — text, score, author, timestamp, parent comment (full thread hierarchy), depth level
- Subreddits — name, description, subscriber count, rules, moderator list, sidebar content, wiki pages
- User profiles — username, karma (post + comment), account age, post history, comment history
- Search results — posts matching keywords across all subreddits or within specific communities
- Trending/popular data — hot posts, rising posts, top posts by timeframe
This data feeds content aggregation pipelines, brand monitoring tools, AI training datasets, and competitive research platforms.
Challenges When Scraping Reddit
While Reddit is more accessible than many platforms, scaling presents real challenges:
API Rate Limits (Post-2023)
Reddit drastically restricted free API access in 2023, killing many third-party apps. The official API now requires authentication and enforces strict rate limits: 100 requests/minute for OAuth clients, with content restrictions on bulk access.
Authentication Complexity
Reddit's API requires OAuth2 authentication with client IDs and refresh tokens. The old .json URL trick (appending .json to any Reddit URL) still works but is aggressively rate-limited.
Anti-Bot Measures
Reddit uses Cloudflare protection on its web frontend. Automated requests without proper headers and browser fingerprints get challenged or blocked.
Nested Comment Threads
Reddit's comment system is deeply nested. The API paginates "more comments" links, requiring multiple requests to capture full threads. Long threads can require dozens of follow-up requests.
Content Moderation and Deletion
Posts and comments get deleted, removed by moderators, or edited frequently. Point-in-time scraping misses content changes, and deleted content is irrecoverable through the API.
Method 1: Using SimpleCrawl API (Easiest)
SimpleCrawl handles rendering, anti-bot bypass, and returns clean structured data from any Reddit URL:
curl -X POST https://api.simplecrawl.com/v1/scrape \
-H "Authorization: Bearer sc_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"url": "https://www.reddit.com/r/webdev/top/?t=week",
"format": "markdown"
}'
For structured post data:
{
"url": "https://www.reddit.com/r/webdev/top/?t=week",
"format": "extract",
"schema": {
"posts": [{
"title": "string",
"author": "string",
"score": "number",
"comment_count": "number",
"url": "string",
"flair": "string"
}]
}
}
For full comment thread extraction:
{
"url": "https://www.reddit.com/r/programming/comments/abc123/post_title/",
"format": "extract",
"schema": {
"post_title": "string",
"post_body": "string",
"comments": [{
"author": "string",
"text": "string",
"score": "number",
"replies": ["string"]
}]
}
}
Method 2: DIY with Python (Manual)
Using Reddit's JSON Endpoints
The simplest approach uses Reddit's built-in JSON endpoints:
import requests
import time
headers = {
"User-Agent": "MyApp/1.0 (contact: dev@example.com)",
}
def get_subreddit_posts(subreddit: str, sort: str = "hot", limit: int = 25) -> list:
url = f"https://www.reddit.com/r/{subreddit}/{sort}.json?limit={limit}"
response = requests.get(url, headers=headers)
if response.status_code == 429:
time.sleep(60)
return get_subreddit_posts(subreddit, sort, limit)
if response.status_code != 200:
return []
data = response.json()
posts = []
for child in data["data"]["children"]:
post = child["data"]
posts.append({
"title": post["title"],
"author": post["author"],
"score": post["score"],
"url": post["url"],
"num_comments": post["num_comments"],
"created_utc": post["created_utc"],
"selftext": post.get("selftext", ""),
})
return posts
posts = get_subreddit_posts("python", "top", 10)
for p in posts:
print(f"[{p['score']}] {p['title']} by u/{p['author']}")
Using PRAW (Python Reddit API Wrapper)
import praw
reddit = praw.Reddit(
client_id="YOUR_CLIENT_ID",
client_secret="YOUR_CLIENT_SECRET",
user_agent="MyApp/1.0",
)
subreddit = reddit.subreddit("webdev")
for post in subreddit.hot(limit=10):
print(f"[{post.score}] {post.title}")
print(f" Comments: {post.num_comments}")
print(f" URL: {post.url}")
print()
post.comments.replace_more(limit=0)
for comment in post.comments[:5]:
print(f" [{comment.score}] {comment.body[:100]}")
Web Scraping with BeautifulSoup (Fallback)
When the API's rate limits are too restrictive:
import requests
from bs4 import BeautifulSoup
def scrape_old_reddit(subreddit: str) -> list:
url = f"https://old.reddit.com/r/{subreddit}/"
headers = {"User-Agent": "Mozilla/5.0 (compatible; MyBot/1.0)"}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
posts = []
for thing in soup.select("div.thing"):
title_el = thing.select_one("a.title")
score_el = thing.select_one("div.score.unvoted")
if title_el:
posts.append({
"title": title_el.text,
"href": title_el.get("href"),
"score": score_el.text if score_el else "0",
})
return posts
For more Python techniques, see our web scraping with Python guide. If you prefer JavaScript, check the Node.js guide.
Why SimpleCrawl Is Better for Reddit
| Feature | Reddit API | DIY Scraping | SimpleCrawl |
|---|---|---|---|
| Rate limits | 100 req/min | IP-based blocks | High throughput |
| Auth required | Yes (OAuth2) | No (but fragile) | API key only |
| Full threads | Paginated | Partial | Complete |
| Historical data | Limited | Unavailable | Cached snapshots |
| Setup time | 30+ minutes | Hours | Minutes |
SimpleCrawl combines the best of both approaches — structured data without API restrictions, and full page rendering without building custom scrapers.
Legal Considerations
- Reddit's API Terms — Reddit's Data API Terms require attribution and prohibit commercial use without a paid agreement (as of 2024).
- Public data — posts and comments on public subreddits are publicly accessible. Courts have generally found that scraping public data doesn't violate the CFAA.
- User privacy — avoid scraping private messages, banned/quarantined subreddits, or correlating usernames to real identities.
- robots.txt — check Reddit's crawling permissions with our robots.txt checker.
- AI training data — Reddit licensed its data to Google and OpenAI. Independent scrapers should review Reddit's terms regarding AI training usage.
FAQ
Does appending .json to Reddit URLs still work?
Yes, adding .json to most Reddit URLs returns JSON data. However, Reddit rate-limits this aggressively — you'll get HTTP 429 responses after a few dozen requests. SimpleCrawl provides a more reliable alternative.
Can I scrape Reddit without an API key?
You can scrape old.reddit.com with basic HTTP requests or use the .json endpoint trick, but both are rate-limited. SimpleCrawl requires only a SimpleCrawl API key — no Reddit credentials needed.
How do I scrape all comments in a thread?
Reddit's API paginates comments with "more" links, requiring recursive fetches. SimpleCrawl's extract mode captures the full rendered thread in one request, including deeply nested replies.
Is it legal to use scraped Reddit data for AI training?
This is a gray area. Reddit's terms restrict commercial use of their data without a license. For research or personal projects, scraping public data is generally permissible. Always consult legal counsel for commercial applications.
What subreddit data is most valuable?
Popular data extraction targets include r/wallstreetbets (financial sentiment), r/technology (trend detection), product-specific subreddits (customer feedback), and local subreddits (market research).
Ready to try SimpleCrawl?
We're building the simplest web scraping API for AI. Join the waitlist and get 500 free credits at launch.
More scraping guides
How to Scrape Amazon — Complete Guide (2026)
Learn how to scrape Amazon product data, prices, reviews, and rankings. Compare DIY Python scrapers with the SimpleCrawl API for reliable Amazon data extraction.
How to Scrape Google — Complete Guide (2026)
Learn how to scrape Google search results, SERP data, featured snippets, and People Also Ask boxes. Compare Python scrapers with the SimpleCrawl SERP API.
How to Scrape Indeed — Complete Guide (2026)
Learn how to scrape Indeed job listings, salaries, and company reviews. Compare Python scrapers with the SimpleCrawl API for reliable Indeed data extraction.
How to Scrape LinkedIn — Complete Guide (2026)
Learn how to scrape LinkedIn profiles, job listings, and company data. Covers DIY Python methods and the SimpleCrawl API for reliable LinkedIn data extraction.