use casecontent aggregationnewsRSS

Content Aggregation API — Aggregate News and Content Sources

Use SimpleCrawl to build content aggregation systems that pull articles, news, and blog posts from multiple sources. Clean markdown output ready for feeds, newsletters, and AI summarization.

SimpleCrawl TeamFebruary 18, 20267 min read

A content aggregation API pulls articles, blog posts, and news from multiple websites into a single, structured feed. Whether you are building a news reader, curating an industry newsletter, or feeding content into an AI summarization pipeline, SimpleCrawl extracts clean article content, metadata, and structured data from any source.

The Content Aggregation Problem

RSS feeds are dying. Many sites no longer offer them, and those that do often provide truncated content. Meanwhile, scraping raw HTML from diverse sites means:

Different HTML structures per site
JavaScript-rendered content that basic scrapers miss
Boilerplate (navigation, ads, related articles) mixed with content
No standardized metadata extraction

SimpleCrawl solves this by returning clean markdown with consistent metadata from any URL, regardless of the source site's structure.

How It Works

Extract a Single Article

import simplecrawl

client = simplecrawl.Client(api_key="YOUR_KEY")

result = client.scrape("https://techcrunch.com/2026/03/01/example-article", output="json", schema={
    "title": "string",
    "author": "string",
    "published_date": "string",
    "category": "string",
    "content_markdown": "string",
    "summary": "string",
    "tags": ["string"],
    "reading_time_minutes": "number",
    "featured_image_url": "string"
})

article = result.data
print(f"{article['title']} by {article['author']}")

Aggregate Multiple Sources

sources = [
    {
        "name": "TechCrunch",
        "feed_url": "https://techcrunch.com/",
        "article_pattern": "https://techcrunch.com/2026/",
    },
    {
        "name": "The Verge",
        "feed_url": "https://www.theverge.com/tech",
        "article_pattern": "https://www.theverge.com/2026/",
    },
    {
        "name": "Hacker News",
        "feed_url": "https://news.ycombinator.com/",
        "article_pattern": None,
    },
]

article_schema = {
    "title": "string",
    "author": "string",
    "published_date": "string",
    "summary": "string",
    "tags": ["string"],
}

def discover_articles(source):
    """Get article URLs from a source's main page."""
    result = client.scrape(source["feed_url"], output="json", schema={
        "article_links": [{
            "title": "string",
            "url": "string",
        }]
    })
    return [
        link for link in result.data.get("article_links", [])
        if link.get("url")
    ]

def extract_article(url):
    """Extract full article content and metadata."""
    result = client.scrape(url, output="json", schema=article_schema)
    markdown = client.scrape(url, output="markdown")
    return {
        **result.data,
        "url": url,
        "content": markdown.markdown,
        "extracted_at": datetime.utcnow().isoformat(),
    }

Build a Content Feed

from datetime import datetime
import json

def build_daily_feed(sources):
    all_articles = []

    for source in sources:
        links = discover_articles(source)
        for link in links[:10]:
            try:
                article = extract_article(link["url"])
                article["source"] = source["name"]
                all_articles.append(article)
            except Exception as e:
                print(f"Failed: {link['url']} — {e}")

    all_articles.sort(
        key=lambda a: a.get("published_date", ""),
        reverse=True
    )

    return all_articles

feed = build_daily_feed(sources)
print(f"Aggregated {len(feed)} articles")

Use Cases for Content Aggregation

Aggregate top articles from 10–20 industry sources daily. Use AI to summarize each article and generate a newsletter:

from openai import OpenAI

openai_client = OpenAI()

def generate_newsletter(articles):
    summaries = []
    for article in articles[:15]:
        response = openai_client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{
                "role": "user",
                "content": f"Summarize in 2-3 sentences:\n\n{article['content'][:3000]}"
            }]
        )
        summaries.append({
            "title": article["title"],
            "url": article["url"],
            "source": article["source"],
            "summary": response.choices[0].message.content,
        })

    newsletter = "# Daily Tech Digest\n\n"
    for s in summaries:
        newsletter += f"## [{s['title']}]({s['url']})\n"
        newsletter += f"*{s['source']}*\n\n"
        newsletter += f"{s['summary']}\n\n---\n\n"

    return newsletter

Competitive Content Monitoring

Track what competitors publish and when:

competitor_blogs = [
    "https://competitor-a.com/blog",
    "https://competitor-b.com/blog",
    "https://competitor-c.com/resources",
]

def monitor_competitor_content(blog_urls):
    new_posts = []
    for url in blog_urls:
        result = client.scrape(url, output="json", schema={
            "recent_posts": [{
                "title": "string",
                "url": "string",
                "date": "string",
                "topic": "string",
            }]
        })
        new_posts.extend(result.data.get("recent_posts", []))
    return new_posts

Combine with SEO crawling to analyze competitor content quality and targeting.

Research Feeds

Aggregate papers, articles, and reports for research teams:

research_sources = [
    "https://arxiv.org/list/cs.AI/recent",
    "https://paperswithcode.com/latest",
    "https://huggingface.co/papers",
]

paper_schema = {
    "papers": [{
        "title": "string",
        "authors": ["string"],
        "abstract": "string",
        "url": "string",
        "date": "string",
    }]
}

See our research data extraction use case for more details on academic scraping.

Content-Powered AI Applications

Feed aggregated content into RAG pipelines for AI assistants that stay current:

daily_articles = build_daily_feed(sources)

for article in daily_articles:
    chunks = chunk_text(article["content"], max_tokens=512)
    embeddings = embed(chunks)
    vector_db.upsert(
        embeddings,
        metadata={
            "title": article["title"],
            "source": article["source"],
            "date": article.get("published_date"),
            "url": article["url"],
        }
    )

Handling Aggregation Challenges

Deduplication

The same story gets covered by multiple outlets. Deduplicate by title similarity:

from difflib import SequenceMatcher

def is_duplicate(new_title, existing_titles, threshold=0.8):
    for existing in existing_titles:
        ratio = SequenceMatcher(None, new_title.lower(), existing.lower()).ratio()
        if ratio > threshold:
            return True
    return False

unique_articles = []
seen_titles = []
for article in all_articles:
    if not is_duplicate(article["title"], seen_titles):
        unique_articles.append(article)
        seen_titles.append(article["title"])

Rate Limiting

When scraping many sources, respect rate limits and space out requests:

import time

def batch_with_delay(urls, delay=1.0):
    results = []
    for url in urls:
        result = client.scrape(url, output="markdown")
        results.append(result)
        time.sleep(delay)
    return results

For large batches, use SimpleCrawl's batch API with webhook delivery instead.

Content Freshness

Track when content was last seen and avoid re-processing:

import hashlib

def content_hash(text):
    return hashlib.md5(text.encode()).hexdigest()

def is_new_content(url, content, seen_hashes):
    h = content_hash(content)
    if h in seen_hashes.get(url, set()):
        return False
    seen_hashes.setdefault(url, set()).add(h)
    return True

Cost for Content Aggregation

Use case	Sources	Articles/day	Credits/month	Plan
Personal feed	5 sites	25	~750	Starter ($29)
Team newsletter	15 sites	75	~2,250	Starter ($29)
Competitive monitoring	10 competitors	50	~1,500	Starter ($29)
Research aggregator	20 sources	200	~6,000	Growth ($79)
Commercial news product	50+ sources	1,000+	~30,000+	Scale ($199)

Content aggregation is one of the most cost-efficient use cases because you typically scrape tens to hundreds of articles per day, not thousands.

FAQ

What is a content aggregation API?

A content aggregation API extracts articles, news, and content from multiple web sources and returns them in a standardized format. Instead of building custom parsers for each site, you get clean content and metadata from any URL through a single API.

Is content aggregation the same as RSS?

No. RSS is a standardized feed format that websites opt into providing. Content aggregation scrapes the actual web pages, working even when sites do not offer RSS feeds. SimpleCrawl also extracts richer metadata and cleaner content than most RSS feeds provide.

Can I use aggregated content on my own site?

Displaying full article text from other sites without permission raises copyright concerns. Common approaches: show titles and summaries (generally fair use), link to original sources, use AI to generate original commentary or analysis based on the source material. Consult legal counsel for your specific use case.

How do I handle paywalled content?

SimpleCrawl scrapes publicly accessible content. Paywalled articles that require login or subscription are not accessible through the API. Focus on sources that offer free content, or use only the publicly visible portions (title, summary, first paragraph).

Can I aggregate content for AI training?

Content aggregation for AI training raises legal questions (ongoing litigation around AI training data). SimpleCrawl provides the extraction capability — the legal responsibility for how you use the data rests with you. For research purposes, see our research data extraction guide.

How does this compare to dedicated news APIs?

News APIs (NewsAPI, GDELT, Event Registry) provide pre-aggregated news with metadata and categorization. They cover major news sources but miss niche blogs, industry publications, and company blogs. SimpleCrawl works on any website, giving you access to sources that news APIs do not cover. Many teams use both: a news API for broad coverage and SimpleCrawl for niche sources.

Get Started

Build your content aggregation pipeline with SimpleCrawl. Join the waitlist for 500 free credits — enough to aggregate content from dozens of sources for weeks.

For deeper analysis of aggregated content, combine with our RAG pipeline guide to build searchable knowledge bases.

Ready to try SimpleCrawl?

We're building the simplest web scraping API for AI. Join the waitlist and get 500 free credits at launch.