Glossary

What is Robots.txt? Syntax + How Crawlers Use It

Robots.txt is a text file that tells web crawlers which pages they are allowed or not allowed to access on a website. Learn how it works and how to respect it.

4 min read

Definition

Robots.txt is a plain text file placed at the root of a website (e.g., https://example.com/robots.txt) that provides instructions to web crawlers about which pages or sections of the site they are allowed to access. It follows the Robots Exclusion Protocol, a standard that has been in use since 1994.

The file uses a simple syntax of User-agent directives (specifying which crawler the rules apply to) and Disallow/Allow rules (specifying which paths are restricted or permitted). While robots.txt is advisory — crawlers are not technically forced to obey it — respecting it is considered a fundamental ethical practice in web crawling and web scraping.

How Robots.txt Works

The robots.txt file sits at the root of a domain and is the first thing a well-behaved crawler checks before accessing a site:

Fetching — Before crawling any page on example.com, a crawler requests https://example.com/robots.txt.
Parsing — The crawler reads the file and identifies rules that match its User-agent string. A wildcard User-agent: * applies to all crawlers.
Rule matching — For each URL the crawler wants to visit, it checks whether the path matches any Disallow or Allow rules. More specific rules take precedence.
Compliance — If a URL is disallowed, the crawler skips it. If allowed (or not mentioned), the crawler proceeds.

A typical robots.txt file looks like this:

User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /public/

User-agent: Googlebot
Allow: /

Sitemap: https://example.com/sitemap.xml

Key directives include:

User-agent — Specifies which crawler the following rules apply to.
Disallow — Blocks access to the specified path.
Allow — Permits access to a path that would otherwise be disallowed by a broader rule.
Sitemap — Points to the site's XML sitemap for discovery.
Crawl-delay — Specifies a wait time (in seconds) between requests. Not all crawlers support this.

Robots.txt in Web Scraping

Robots.txt plays an important role in ethical web scraping. Understanding and respecting it helps maintain a good relationship with target websites and avoids legal issues:

Legal significance — Courts have considered robots.txt violations in web scraping lawsuits. Respecting it demonstrates good faith.
Content boundaries — Robots.txt often blocks paths to admin panels, user data, internal tools, and duplicate content — areas you typically don't want to scrape anyway.
Sitemap discovery — The Sitemap directive is valuable for scrapers, pointing directly to a complete list of the site's pages.
Rate guidance — The Crawl-delay directive tells you how much time to wait between requests, complementing your own rate limiting strategy.

It's worth noting that robots.txt can block useful content too. Some sites broadly disallow scraping even on public content. In these cases, scrapers must weigh the ethical and legal implications of proceeding.

How SimpleCrawl Handles Robots.txt

SimpleCrawl provides built-in robots.txt awareness with sensible defaults:

Automatic checking — Before accessing any page, SimpleCrawl fetches and parses the target site's robots.txt file.
Default compliance — By default, SimpleCrawl respects all Disallow rules, ensuring your scraping operations are ethical and defensible.
Transparent reporting — When a page is blocked by robots.txt, SimpleCrawl reports it clearly in the API response so you know exactly why a URL was skipped.
Crawl-delay support — If a robots.txt specifies a Crawl-delay, SimpleCrawl respects it automatically, adjusting its request rate for that domain.
Sitemap extraction — SimpleCrawl reads Sitemap directives and can use them to build comprehensive crawl plans.

Our free Robots.txt Checker tool lets you inspect any site's robots.txt rules before you start scraping.

Web Crawling — Systematically navigating the web by following links
Rate Limiting — Controlling request frequency to server limits
Web Scraping — Automated data extraction from websites
Structured Data — Machine-readable data embedded in web pages

Ready to try SimpleCrawl?

We're building the simplest web scraping API for AI. Join the waitlist and get 500 free credits at launch.

Definition

How Robots.txt Works

Robots.txt in Web Scraping

How SimpleCrawl Handles Robots.txt

Related Terms

Ready to try SimpleCrawl?