What is Robots.txt? — SimpleCrawl Glossary
Robots.txt is a text file that tells web crawlers which pages they are allowed or not allowed to access on a website. Learn how it works and how to respect it.
4 min read
Definition
Robots.txt is a plain text file placed at the root of a website (e.g., https://example.com/robots.txt) that provides instructions to web crawlers about which pages or sections of the site they are allowed to access. It follows the Robots Exclusion Protocol, a standard that has been in use since 1994.
The file uses a simple syntax of User-agent directives (specifying which crawler the rules apply to) and Disallow/Allow rules (specifying which paths are restricted or permitted). While robots.txt is advisory — crawlers are not technically forced to obey it — respecting it is considered a fundamental ethical practice in web crawling and web scraping.
How Robots.txt Works
The robots.txt file sits at the root of a domain and is the first thing a well-behaved crawler checks before accessing a site:
- Fetching — Before crawling any page on
example.com, a crawler requestshttps://example.com/robots.txt. - Parsing — The crawler reads the file and identifies rules that match its
User-agentstring. A wildcardUser-agent: *applies to all crawlers. - Rule matching — For each URL the crawler wants to visit, it checks whether the path matches any
DisalloworAllowrules. More specific rules take precedence. - Compliance — If a URL is disallowed, the crawler skips it. If allowed (or not mentioned), the crawler proceeds.
A typical robots.txt file looks like this:
User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /public/
User-agent: Googlebot
Allow: /
Sitemap: https://example.com/sitemap.xml
Key directives include:
- User-agent — Specifies which crawler the following rules apply to.
- Disallow — Blocks access to the specified path.
- Allow — Permits access to a path that would otherwise be disallowed by a broader rule.
- Sitemap — Points to the site's XML sitemap for discovery.
- Crawl-delay — Specifies a wait time (in seconds) between requests. Not all crawlers support this.
Robots.txt in Web Scraping
Robots.txt plays an important role in ethical web scraping. Understanding and respecting it helps maintain a good relationship with target websites and avoids legal issues:
- Legal significance — Courts have considered robots.txt violations in web scraping lawsuits. Respecting it demonstrates good faith.
- Content boundaries — Robots.txt often blocks paths to admin panels, user data, internal tools, and duplicate content — areas you typically don't want to scrape anyway.
- Sitemap discovery — The
Sitemapdirective is valuable for scrapers, pointing directly to a complete list of the site's pages. - Rate guidance — The
Crawl-delaydirective tells you how much time to wait between requests, complementing your own rate limiting strategy.
It's worth noting that robots.txt can block useful content too. Some sites broadly disallow scraping even on public content. In these cases, scrapers must weigh the ethical and legal implications of proceeding.
How SimpleCrawl Handles Robots.txt
SimpleCrawl provides built-in robots.txt awareness with sensible defaults:
- Automatic checking — Before accessing any page, SimpleCrawl fetches and parses the target site's robots.txt file.
- Default compliance — By default, SimpleCrawl respects all
Disallowrules, ensuring your scraping operations are ethical and defensible. - Transparent reporting — When a page is blocked by robots.txt, SimpleCrawl reports it clearly in the API response so you know exactly why a URL was skipped.
- Crawl-delay support — If a robots.txt specifies a
Crawl-delay, SimpleCrawl respects it automatically, adjusting its request rate for that domain. - Sitemap extraction — SimpleCrawl reads
Sitemapdirectives and can use them to build comprehensive crawl plans.
Our free Robots.txt Checker tool lets you inspect any site's robots.txt rules before you start scraping.
Related Terms
- Web Crawling — Systematically navigating the web by following links
- Rate Limiting — Controlling request frequency to server limits
- Web Scraping — Automated data extraction from websites
- Structured Data — Machine-readable data embedded in web pages
Ready to try SimpleCrawl?
We're building the simplest web scraping API for AI. Join the waitlist and get 500 free credits at launch.