Robots.txt Checker
Enter any domain to fetch and analyze its robots.txt file. See which paths are allowed or blocked for each crawler, discover sitemaps, and check crawl-delay settings.
Automate robots.txt compliance
SimpleCrawl automatically checks and respects robots.txt for every page it crawls. No manual configuration needed — ethical scraping by default.
How Robots.txt Works
Robots.txt is a plain text file that website owners place at the root of their domain to communicate with web crawlers. It follows the Robots Exclusion Protocol and tells bots which parts of the site they can and cannot access.
The file uses `User-agent` directives to target specific crawlers (or all crawlers with `*`) and `Allow`/`Disallow` rules to control path access. It can also include `Sitemap` references and `Crawl-delay` settings.
What This Tool Shows You
- -User-Agent Groups: See which rules apply to which crawlers. Common agents include Googlebot, Bingbot, GPTBot, and the wildcard * (all crawlers).
- -Allow / Disallow Rules: Visual breakdown of which paths are accessible and which are blocked for each user-agent group.
- -Sitemaps: Discover XML sitemaps referenced in the robots.txt. Sitemaps are valuable for crawlers to find all pages on a site.
- -Crawl-Delay: Some sites specify how long crawlers should wait between requests. This tool highlights these directives.
Use Cases
- -Pre-Scrape Audit: Check a target site's robots.txt before building a scraper to understand access restrictions and plan your approach.
- -SEO Troubleshooting: Verify that your own robots.txt isn't accidentally blocking important pages from search engine crawlers.
- -Competitive Analysis: See what competitors are hiding from crawlers. Blocked paths can reveal internal tools, staging sites, and content structure.
- -AI Crawler Policy: Check if sites block AI crawlers like GPTBot, CCBot, or Anthropic's ClaudeBot. This is increasingly common and important for data collection compliance.
Frequently Asked Questions
Is robots.txt legally binding?
Robots.txt is advisory, not legally enforceable on its own. However, courts have referenced robots.txt violations in scraping lawsuits, and respecting it is considered a best practice and a sign of good faith.
What if a site doesn't have a robots.txt?
If no robots.txt file exists (the server returns a 404), crawlers typically assume all paths are allowed. The absence of robots.txt is not the same as blocking.
Can robots.txt block my scraper?
Robots.txt is a request, not a technical barrier. Your scraper can choose to ignore it, but doing so may violate the site's terms of service and could have legal implications.
Why do some sites block AI crawlers specifically?
Many sites now add Disallow rules for GPTBot, CCBot, and other AI training crawlers to prevent their content from being used in LLM training datasets without permission.
Ready to try SimpleCrawl?
We're building the simplest web scraping API for AI. Join the waitlist and get 500 free credits at launch.