use caseSEOweb crawlingsite audit

SEO Crawler API — Audit Any Website at Scale

Use SimpleCrawl's API to build SEO crawlers that audit websites at scale. Extract title tags, meta descriptions, headings, links, and technical SEO data from any page.

SimpleCrawl TeamFebruary 22, 20267 min read

An SEO crawler API lets you programmatically audit any website's on-page SEO — title tags, meta descriptions, heading hierarchy, internal links, canonical tags, and more. SimpleCrawl extracts this data in a single API call, making it straightforward to build custom SEO audit tools without managing headless browsers or parsing raw HTML.

What You Can Extract for SEO Audits

SimpleCrawl's structured extraction pulls the exact data SEO professionals need:

curl -X POST https://api.simplecrawl.com/scrape \
  -H "Authorization: Bearer YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/blog/seo-guide",
    "output": "json",
    "schema": {
      "title_tag": "string",
      "meta_description": "string",
      "canonical_url": "string",
      "h1": "string",
      "h2s": ["string"],
      "word_count": "number",
      "internal_links": ["string"],
      "external_links": ["string"],
      "images_without_alt": "number",
      "schema_markup_types": ["string"]
    }
  }'

Response:

{
  "data": {
    "title_tag": "Complete SEO Guide for 2026 | Example Blog",
    "meta_description": "Learn everything about SEO in 2026...",
    "canonical_url": "https://example.com/blog/seo-guide",
    "h1": "Complete SEO Guide for 2026",
    "h2s": ["On-Page SEO", "Technical SEO", "Link Building", "Content Strategy"],
    "word_count": 3847,
    "internal_links": ["/blog/keyword-research", "/blog/technical-seo", "/tools/site-audit"],
    "external_links": ["https://developers.google.com/search", "https://moz.com/learn"],
    "images_without_alt": 2,
    "schema_markup_types": ["Article", "BreadcrumbList"]
  }
}

Building a Site-Wide SEO Audit

Step 1: Discover All Pages

Start with the sitemap to get all crawlable URLs:

import simplecrawl
import xml.etree.ElementTree as ET
import requests

client = simplecrawl.Client(api_key="YOUR_KEY")

sitemap_resp = requests.get("https://example.com/sitemap.xml")
root = ET.fromstring(sitemap_resp.content)
ns = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}
urls = [loc.text for loc in root.findall(".//sm:loc", ns)]

print(f"Found {len(urls)} URLs to audit")

You can also use our free Sitemap Analyzer tool to inspect sitemaps interactively.

Step 2: Extract SEO Data from Every Page

seo_schema = {
    "title_tag": "string",
    "meta_description": "string",
    "canonical_url": "string",
    "h1": "string",
    "h2s": ["string"],
    "word_count": "number",
    "internal_links": ["string"],
    "external_links": ["string"],
    "images_without_alt": "number",
}

results = client.batch(urls=urls, output="json", schema=seo_schema)

audit_data = []
for result in results:
    audit_data.append({
        "url": result.url,
        **result.data,
        "status": "success"
    })

Step 3: Identify Issues

issues = {
    "missing_title": [],
    "long_title": [],
    "missing_meta": [],
    "long_meta": [],
    "missing_h1": [],
    "multiple_h1": [],
    "thin_content": [],
    "no_internal_links": [],
    "images_need_alt": [],
    "canonical_mismatch": [],
}

for page in audit_data:
    if not page.get("title_tag"):
        issues["missing_title"].append(page["url"])
    elif len(page["title_tag"]) > 60:
        issues["long_title"].append(page["url"])

    if not page.get("meta_description"):
        issues["missing_meta"].append(page["url"])
    elif len(page["meta_description"]) > 160:
        issues["long_meta"].append(page["url"])

    if not page.get("h1"):
        issues["missing_h1"].append(page["url"])

    if page.get("word_count", 0) < 300:
        issues["thin_content"].append(page["url"])

    if not page.get("internal_links"):
        issues["no_internal_links"].append(page["url"])

    if page.get("images_without_alt", 0) > 0:
        issues["images_need_alt"].append(page["url"])

    if page.get("canonical_url") and page.get("canonical_url") != page["url"]:
        issues["canonical_mismatch"].append(page["url"])

Step 4: Generate the Report

def generate_report(audit_data, issues):
    report = "# SEO Audit Report\n\n"
    report += f"**Pages audited:** {len(audit_data)}\n\n"
    report += "## Issues Found\n\n"
    report += "| Issue | Count | Severity |\n|---|---|---|\n"

    severity_map = {
        "missing_title": "Critical",
        "missing_h1": "Critical",
        "missing_meta": "High",
        "thin_content": "High",
        "canonical_mismatch": "High",
        "long_title": "Medium",
        "long_meta": "Medium",
        "no_internal_links": "Medium",
        "images_need_alt": "Low",
    }

    for issue_key, urls in issues.items():
        if urls:
            label = issue_key.replace("_", " ").title()
            severity = severity_map.get(issue_key, "Low")
            report += f"| {label} | {len(urls)} | {severity} |\n"

    return report

print(generate_report(audit_data, issues))

SEO Checks You Can Automate

On-Page Checks

Check	What SimpleCrawl extracts	Why it matters
Title tag length	`title_tag` → character count	Titles over 60 chars get truncated in SERPs
Meta description	`meta_description`	Missing descriptions = Google generates its own
H1 tag	`h1`	Missing or duplicate H1s hurt page focus
Heading hierarchy	`h2s`, `h3s`	Proper hierarchy helps crawlers understand content structure
Word count	`word_count`	Thin content (under 300 words) rarely ranks
Internal links	`internal_links`	Orphan pages with no internal links are hard to discover
Image alt text	`images_without_alt`	Missing alt text hurts accessibility and image search
Canonical tags	`canonical_url`	Mismatched canonicals cause indexing confusion

Content Quality Checks

Use SimpleCrawl's markdown output combined with NLP for deeper analysis:

result = client.scrape(url, output="markdown")
markdown = result.markdown

checks = {
    "has_primary_keyword": primary_keyword.lower() in markdown[:500].lower(),
    "has_internal_links": "[" in markdown and "](/" in markdown,
    "has_external_links": "](http" in markdown,
    "has_images": "![" in markdown,
    "has_code_examples": "```" in markdown,
    "readability_score": calculate_flesch_kincaid(markdown),
}

Technical SEO Checks

Combine SimpleCrawl with other tools for a complete technical audit:

# Check robots.txt compliance
robots_result = client.scrape(f"{domain}/robots.txt", output="markdown")

# Check meta robots tags
page_result = client.scrape(url, output="json", schema={
    "meta_robots": "string",
    "x_robots_tag": "string",
    "hreflang_tags": ["string"],
    "structured_data_types": ["string"],
})

Use our Robots.txt Checker and Meta Tag Extractor for individual page analysis.

Monitoring SEO Over Time

Run audits on a schedule and track changes:

def track_seo_changes(current_audit, previous_audit):
    changes = []
    for curr in current_audit:
        prev = next(
            (p for p in previous_audit if p["url"] == curr["url"]),
            None
        )
        if not prev:
            changes.append({"url": curr["url"], "type": "new_page"})
            continue

        if curr.get("title_tag") != prev.get("title_tag"):
            changes.append({
                "url": curr["url"],
                "type": "title_changed",
                "old": prev["title_tag"],
                "new": curr["title_tag"],
            })

        if curr.get("word_count", 0) < prev.get("word_count", 0) * 0.5:
            changes.append({
                "url": curr["url"],
                "type": "content_removed",
                "old_count": prev["word_count"],
                "new_count": curr["word_count"],
            })

    removed = [
        p["url"] for p in previous_audit
        if not any(c["url"] == p["url"] for c in current_audit)
    ]
    for url in removed:
        changes.append({"url": url, "type": "page_removed"})

    return changes

Cost for SEO Auditing

Site size	Audit frequency	Credits/month	SimpleCrawl plan
100 pages	Weekly	400	Starter ($29)
1,000 pages	Weekly	4,000	Starter ($29)
5,000 pages	Weekly	20,000	Growth ($79)
10,000 pages	Daily	300,000	Enterprise
50,000 pages	Weekly	200,000	Enterprise

Most sites under 5,000 pages fit comfortably in the Starter or Growth plan with weekly audits.

FAQ

What is an SEO crawler API?

An SEO crawler API is a web scraping service optimized for extracting SEO-relevant data from web pages — title tags, meta descriptions, headings, links, schema markup, and content metrics. Unlike general-purpose scraping, SEO crawling focuses on the metadata and structural elements that affect search rankings.

How is this different from Screaming Frog or Ahrefs?

Screaming Frog and Ahrefs are complete SEO tools with crawling built in. SimpleCrawl is the extraction layer — you get raw SEO data and build custom analysis on top. Use SimpleCrawl when you need programmatic access to SEO data for custom dashboards, alerts, or integration with your existing tools.

Ready to try SimpleCrawl?

We're building the simplest web scraping API for AI. Join the waitlist and get 500 free credits at launch.