Content Aggregation API — Aggregate News and Content Sources
Use SimpleCrawl to build content aggregation systems that pull articles, news, and blog posts from multiple sources. Clean markdown output ready for feeds, newsletters, and AI summarization.
A content aggregation API pulls articles, blog posts, and news from multiple websites into a single, structured feed. Whether you are building a news reader, curating an industry newsletter, or feeding content into an AI summarization pipeline, SimpleCrawl extracts clean article content, metadata, and structured data from any source.
The Content Aggregation Problem
RSS feeds are dying. Many sites no longer offer them, and those that do often provide truncated content. Meanwhile, scraping raw HTML from diverse sites means:
- Different HTML structures per site
- JavaScript-rendered content that basic scrapers miss
- Boilerplate (navigation, ads, related articles) mixed with content
- No standardized metadata extraction
SimpleCrawl solves this by returning clean markdown with consistent metadata from any URL, regardless of the source site's structure.
How It Works
Extract a Single Article
import simplecrawl
client = simplecrawl.Client(api_key="YOUR_KEY")
result = client.scrape("https://techcrunch.com/2026/03/01/example-article", output="json", schema={
"title": "string",
"author": "string",
"published_date": "string",
"category": "string",
"content_markdown": "string",
"summary": "string",
"tags": ["string"],
"reading_time_minutes": "number",
"featured_image_url": "string"
})
article = result.data
print(f"{article['title']} by {article['author']}")
Aggregate Multiple Sources
sources = [
{
"name": "TechCrunch",
"feed_url": "https://techcrunch.com/",
"article_pattern": "https://techcrunch.com/2026/",
},
{
"name": "The Verge",
"feed_url": "https://www.theverge.com/tech",
"article_pattern": "https://www.theverge.com/2026/",
},
{
"name": "Hacker News",
"feed_url": "https://news.ycombinator.com/",
"article_pattern": None,
},
]
article_schema = {
"title": "string",
"author": "string",
"published_date": "string",
"summary": "string",
"tags": ["string"],
}
def discover_articles(source):
"""Get article URLs from a source's main page."""
result = client.scrape(source["feed_url"], output="json", schema={
"article_links": [{
"title": "string",
"url": "string",
}]
})
return [
link for link in result.data.get("article_links", [])
if link.get("url")
]
def extract_article(url):
"""Extract full article content and metadata."""
result = client.scrape(url, output="json", schema=article_schema)
markdown = client.scrape(url, output="markdown")
return {
**result.data,
"url": url,
"content": markdown.markdown,
"extracted_at": datetime.utcnow().isoformat(),
}
Build a Content Feed
from datetime import datetime
import json
def build_daily_feed(sources):
all_articles = []
for source in sources:
links = discover_articles(source)
for link in links[:10]:
try:
article = extract_article(link["url"])
article["source"] = source["name"]
all_articles.append(article)
except Exception as e:
print(f"Failed: {link['url']} — {e}")
all_articles.sort(
key=lambda a: a.get("published_date", ""),
reverse=True
)
return all_articles
feed = build_daily_feed(sources)
print(f"Aggregated {len(feed)} articles")
Use Cases for Content Aggregation
Industry Newsletter
Aggregate top articles from 10–20 industry sources daily. Use AI to summarize each article and generate a newsletter:
from openai import OpenAI
openai_client = OpenAI()
def generate_newsletter(articles):
summaries = []
for article in articles[:15]:
response = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": f"Summarize in 2-3 sentences:\n\n{article['content'][:3000]}"
}]
)
summaries.append({
"title": article["title"],
"url": article["url"],
"source": article["source"],
"summary": response.choices[0].message.content,
})
newsletter = "# Daily Tech Digest\n\n"
for s in summaries:
newsletter += f"## [{s['title']}]({s['url']})\n"
newsletter += f"*{s['source']}*\n\n"
newsletter += f"{s['summary']}\n\n---\n\n"
return newsletter
Competitive Content Monitoring
Track what competitors publish and when:
competitor_blogs = [
"https://competitor-a.com/blog",
"https://competitor-b.com/blog",
"https://competitor-c.com/resources",
]
def monitor_competitor_content(blog_urls):
new_posts = []
for url in blog_urls:
result = client.scrape(url, output="json", schema={
"recent_posts": [{
"title": "string",
"url": "string",
"date": "string",
"topic": "string",
}]
})
new_posts.extend(result.data.get("recent_posts", []))
return new_posts
Combine with SEO crawling to analyze competitor content quality and targeting.
Research Feeds
Aggregate papers, articles, and reports for research teams:
research_sources = [
"https://arxiv.org/list/cs.AI/recent",
"https://paperswithcode.com/latest",
"https://huggingface.co/papers",
]
paper_schema = {
"papers": [{
"title": "string",
"authors": ["string"],
"abstract": "string",
"url": "string",
"date": "string",
}]
}
See our research data extraction use case for more details on academic scraping.
Content-Powered AI Applications
Feed aggregated content into RAG pipelines for AI assistants that stay current:
daily_articles = build_daily_feed(sources)
for article in daily_articles:
chunks = chunk_text(article["content"], max_tokens=512)
embeddings = embed(chunks)
vector_db.upsert(
embeddings,
metadata={
"title": article["title"],
"source": article["source"],
"date": article.get("published_date"),
"url": article["url"],
}
)
Handling Aggregation Challenges
Deduplication
The same story gets covered by multiple outlets. Deduplicate by title similarity:
from difflib import SequenceMatcher
def is_duplicate(new_title, existing_titles, threshold=0.8):
for existing in existing_titles:
ratio = SequenceMatcher(None, new_title.lower(), existing.lower()).ratio()
if ratio > threshold:
return True
return False
unique_articles = []
seen_titles = []
for article in all_articles:
if not is_duplicate(article["title"], seen_titles):
unique_articles.append(article)
seen_titles.append(article["title"])
Rate Limiting
When scraping many sources, respect rate limits and space out requests:
import time
def batch_with_delay(urls, delay=1.0):
results = []
for url in urls:
result = client.scrape(url, output="markdown")
results.append(result)
time.sleep(delay)
return results
For large batches, use SimpleCrawl's batch API with webhook delivery instead.
Content Freshness
Track when content was last seen and avoid re-processing:
import hashlib
def content_hash(text):
return hashlib.md5(text.encode()).hexdigest()
def is_new_content(url, content, seen_hashes):
h = content_hash(content)
if h in seen_hashes.get(url, set()):
return False
seen_hashes.setdefault(url, set()).add(h)
return True
Cost for Content Aggregation
| Use case | Sources | Articles/day | Credits/month | Plan |
|---|---|---|---|---|
| Personal feed | 5 sites | 25 | ~750 | Starter ($29) |
| Team newsletter | 15 sites | 75 | ~2,250 | Starter ($29) |
| Competitive monitoring | 10 competitors | 50 | ~1,500 | Starter ($29) |
| Research aggregator | 20 sources | 200 | ~6,000 | Growth ($79) |
| Commercial news product | 50+ sources | 1,000+ | ~30,000+ | Scale ($199) |
Content aggregation is one of the most cost-efficient use cases because you typically scrape tens to hundreds of articles per day, not thousands.
FAQ
What is a content aggregation API?
A content aggregation API extracts articles, news, and content from multiple web sources and returns them in a standardized format. Instead of building custom parsers for each site, you get clean content and metadata from any URL through a single API.
Is content aggregation the same as RSS?
No. RSS is a standardized feed format that websites opt into providing. Content aggregation scrapes the actual web pages, working even when sites do not offer RSS feeds. SimpleCrawl also extracts richer metadata and cleaner content than most RSS feeds provide.
Can I use aggregated content on my own site?
Displaying full article text from other sites without permission raises copyright concerns. Common approaches: show titles and summaries (generally fair use), link to original sources, use AI to generate original commentary or analysis based on the source material. Consult legal counsel for your specific use case.
How do I handle paywalled content?
SimpleCrawl scrapes publicly accessible content. Paywalled articles that require login or subscription are not accessible through the API. Focus on sources that offer free content, or use only the publicly visible portions (title, summary, first paragraph).
Can I aggregate content for AI training?
Content aggregation for AI training raises legal questions (ongoing litigation around AI training data). SimpleCrawl provides the extraction capability — the legal responsibility for how you use the data rests with you. For research purposes, see our research data extraction guide.
How does this compare to dedicated news APIs?
News APIs (NewsAPI, GDELT, Event Registry) provide pre-aggregated news with metadata and categorization. They cover major news sources but miss niche blogs, industry publications, and company blogs. SimpleCrawl works on any website, giving you access to sources that news APIs do not cover. Many teams use both: a news API for broad coverage and SimpleCrawl for niche sources.
Get Started
Build your content aggregation pipeline with SimpleCrawl. Join the waitlist for 500 free credits — enough to aggregate content from dozens of sources for weeks.
For deeper analysis of aggregated content, combine with our RAG pipeline guide to build searchable knowledge bases.
Ready to try SimpleCrawl?
We're building the simplest web scraping API for AI. Join the waitlist and get 500 free credits at launch.