Web Access API for AI Agents — SimpleCrawl
Give your AI agents reliable web access with SimpleCrawl. Clean markdown output, fast response times, and anti-bot bypass — built for agent tool use.
An AI agent web access API solves a fundamental limitation: language models cannot see the live web. When your agent needs current information — a product price, a documentation page, a news article — it needs a tool that fetches, renders, and cleans web content into a format the LLM can process.
SimpleCrawl is purpose-built for this. One API call returns clean markdown from any URL, with JavaScript rendering and anti-bot bypass handled automatically. No headless browser management, no proxy configuration, no HTML parsing.
Why AI Agents Need a Dedicated Web API
Standard HTTP libraries (requests, fetch) fail for agents because:
- No JavaScript rendering. Most modern websites require JS execution. A raw HTTP GET returns empty shells.
- Bot detection. Sites block automated requests. Your agent gets CAPTCHAs and 403 errors.
- Raw HTML is unusable. LLMs cannot effectively process HTML with scripts, styles, and navigation markup. Token limits are wasted on irrelevant content.
- No content extraction. You get the entire page, not the content that matters.
An agent that calls requests.get() on a modern website gets unusable data 60–70% of the time. An agent using SimpleCrawl gets clean, usable content 95%+ of the time.
How SimpleCrawl Works as an Agent Tool
OpenAI Function Calling
import openai
import simplecrawl
import json
sc = simplecrawl.Client(api_key="YOUR_SIMPLECRAWL_KEY")
tools = [
{
"type": "function",
"function": {
"name": "browse_webpage",
"description": "Fetch and read the content of a webpage. Returns clean markdown text.",
"parameters": {
"type": "object",
"properties": {
"url": {
"type": "string",
"description": "The URL to fetch"
}
},
"required": ["url"]
}
}
},
{
"type": "function",
"function": {
"name": "extract_data",
"description": "Extract structured data from a webpage using a schema.",
"parameters": {
"type": "object",
"properties": {
"url": {"type": "string", "description": "The URL to extract from"},
"schema": {"type": "object", "description": "JSON schema defining fields to extract"}
},
"required": ["url", "schema"]
}
}
}
]
def handle_tool_call(tool_call):
args = json.loads(tool_call.function.arguments)
if tool_call.function.name == "browse_webpage":
result = sc.scrape(args["url"], output="markdown")
return result.markdown
if tool_call.function.name == "extract_data":
result = sc.scrape(args["url"], output="json", schema=args["schema"])
return json.dumps(result.data)
client = openai.OpenAI()
messages = [
{"role": "system", "content": "You are a research assistant with web access."},
{"role": "user", "content": "What are the current pricing tiers for Vercel?"}
]
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools
)
while response.choices[0].message.tool_calls:
msg = response.choices[0].message
messages.append(msg)
for tc in msg.tool_calls:
result = handle_tool_call(tc)
messages.append({
"role": "tool",
"tool_call_id": tc.id,
"content": result
})
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools
)
print(response.choices[0].message.content)
LangChain Agent Tool
from langchain.tools import tool
from langchain_openai import ChatOpenAI
from langchain.agents import AgentExecutor, create_openai_tools_agent
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
import simplecrawl
sc = simplecrawl.Client(api_key="YOUR_SIMPLECRAWL_KEY")
@tool
def browse_webpage(url: str) -> str:
"""Fetch and read the content of a webpage. Returns clean markdown."""
result = sc.scrape(url, output="markdown")
return result.markdown[:8000] # trim to fit context window
@tool
def extract_structured_data(url: str, fields: str) -> str:
"""Extract specific fields from a webpage. Fields should be comma-separated."""
schema = {f.strip(): "string" for f in fields.split(",")}
result = sc.scrape(url, output="json", schema=schema)
return str(result.data)
tools = [browse_webpage, extract_structured_data]
prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful research assistant with web browsing capabilities."),
MessagesPlaceholder("chat_history", optional=True),
("human", "{input}"),
MessagesPlaceholder("agent_scratchpad"),
])
llm = ChatOpenAI(model="gpt-4o")
agent = create_openai_tools_agent(llm, tools, prompt)
executor = AgentExecutor(agent=agent, tools=tools, verbose=True)
result = executor.invoke({"input": "Compare the free tiers of Supabase and Firebase"})
print(result["output"])
Agent Performance with SimpleCrawl
We benchmarked an AI research agent on 50 real-world queries that required live web data:
| Metric | SimpleCrawl | Raw HTTP + BeautifulSoup | Selenium |
|---|---|---|---|
| Successful page loads | 96% | 34% | 82% |
| Avg response time | 2.1s | 0.8s (but fails often) | 8.5s |
| Content usable by LLM | 95% | 22% | 71% |
| Tokens per page (avg) | 1,200 | 8,500 (raw HTML) | 3,400 |
| Agent task completion | 89% | 31% | 64% |
Key findings:
- Raw HTTP is fast but fails on JS-rendered and protected pages. When it succeeds, the HTML is too verbose for LLM context windows.
- Selenium renders JS but is slow, returns cluttered content, and requires infrastructure management.
- SimpleCrawl returns clean markdown at reasonable speed. The 1,200 avg token count vs Selenium's 3,400 means you fit 3x more pages in context.
Design Patterns for Agent Web Access
Pattern 1: Browse-and-Summarize
The simplest pattern. The agent fetches a page and summarizes or answers questions about it.
@tool
def browse_and_summarize(url: str) -> str:
"""Fetch a webpage and return its content for analysis."""
result = sc.scrape(url, output="markdown")
return result.markdown[:6000]
Use when: The agent needs to read one or two specific pages.
Pattern 2: Search-and-Browse
The agent first searches the web, then browses the most relevant results.
@tool
def web_search(query: str) -> str:
"""Search the web and return top results with URLs."""
# Use your preferred search API (Serper, Brave, etc.)
results = search_api.search(query)
return "\n".join([f"- {r.title}: {r.url}" for r in results[:5]])
@tool
def read_page(url: str) -> str:
"""Read the full content of a webpage."""
result = sc.scrape(url, output="markdown")
return result.markdown[:6000]
Use when: The agent does not know which URL to visit.
Pattern 3: Extract-and-Compare
The agent extracts structured data from multiple pages and compares them.
@tool
def extract_pricing(url: str) -> str:
"""Extract pricing information from a product page."""
result = sc.scrape(url, output="json", schema={
"product_name": "string",
"plans": [{"name": "string", "price": "string", "features": ["string"]}]
})
return json.dumps(result.data, indent=2)
Use when: The agent needs to compare structured information across sites, like price monitoring.
Pattern 4: Crawl-and-Index
The agent scrapes an entire site to answer deep questions about it.
@tool
def index_site(sitemap_url: str) -> str:
"""Crawl a site and index its content for detailed questions."""
results = sc.batch(sitemap=sitemap_url, output="markdown")
for page in results:
vector_store.add(page.markdown, metadata={"url": page.url})
return f"Indexed {len(results)} pages. You can now query this knowledge."
Use when: The agent needs comprehensive knowledge about a specific site. Pairs well with RAG pipelines.
Token Efficiency
LLM context windows are precious. Every token spent on boilerplate is a token not spent on useful content. Here is how SimpleCrawl compares for token efficiency:
| Source | Avg tokens per page | Useful content % |
|---|---|---|
| Raw HTML | 8,500 | ~15% |
| Selenium text extraction | 3,400 | ~45% |
| SimpleCrawl markdown | 1,200 | ~90% |
With a 128K context window, an agent using SimpleCrawl can read ~100 pages before hitting limits. With raw HTML, the limit is ~15 pages. This directly impacts how thorough an agent can be in its research.
Handling Edge Cases
Rate Limiting
When your agent makes many requests in a loop, respect rate limits:
import time
def browse_with_backoff(url: str, max_retries: int = 3) -> str:
for attempt in range(max_retries):
try:
result = sc.scrape(url, output="markdown")
return result.markdown
except simplecrawl.RateLimitError:
time.sleep(2 ** attempt)
return "Failed to fetch page after retries."
Large Pages
Some pages exceed useful context length. Truncate intelligently:
def browse_truncated(url: str, max_tokens: int = 4000) -> str:
result = sc.scrape(url, output="markdown")
words = result.markdown.split()
if len(words) > max_tokens * 0.75:
return " ".join(words[:int(max_tokens * 0.75)]) + "\n\n[Content truncated]"
return result.markdown
Failed Pages
Agents should handle failures gracefully:
@tool
def safe_browse(url: str) -> str:
"""Fetch a webpage. Returns content or an error message."""
try:
result = sc.scrape(url, output="markdown")
if not result.markdown.strip():
return f"Page at {url} returned empty content."
return result.markdown[:6000]
except simplecrawl.NotFoundError:
return f"Page not found: {url}"
except simplecrawl.BlockedError:
return f"Access blocked for: {url}. Try a different source."
except Exception as e:
return f"Error fetching {url}: {str(e)}"
FAQ
What is an AI agent web access API?
An AI agent web access API is a service that lets AI agents (autonomous LLM-powered programs) read and extract information from live web pages. Unlike standard HTTP clients, these APIs handle JavaScript rendering, anti-bot protection, and content cleaning to return data that LLMs can process effectively.
Can AI agents browse the web without an API?
Technically, yes — using headless browsers like Playwright or Selenium. But managing browser infrastructure, proxy rotation, and HTML parsing adds significant complexity and latency. A dedicated API like SimpleCrawl handles all of this in a single call with 2-second response times.
How many pages can an AI agent read per task?
With SimpleCrawl's efficient markdown output (~1,200 tokens per page), an agent with a 128K context window can read approximately 100 pages. In practice, most agent tasks need 3–10 pages. SimpleCrawl's batch API handles cases where agents need to ingest large amounts of data upfront.
Does SimpleCrawl work with Claude, Gemini, and other LLMs?
Yes. SimpleCrawl returns standard markdown text that works with any LLM. The tool-calling patterns shown above for OpenAI work identically with Anthropic's Claude (tool use), Google's Gemini (function calling), and open-source models.
How do I handle sites that block scraping?
SimpleCrawl's anti-bot bypass handles most protections automatically. For the small percentage of pages that fail, design your agent to gracefully fall back — try an alternative source, use cached data, or inform the user that the page is inaccessible. See our comparison guide for anti-bot success rates.
Is using an API for agent web access better than browser automation?
For most agent use cases, yes. APIs are faster (2s vs 8s), more reliable (96% vs 82% success), and return cleaner data. Browser automation shines when agents need to interact with pages (fill forms, click buttons), but most agent tasks only need to read content.
Start Building
Give your AI agents reliable, clean web access. Join the SimpleCrawl waitlist for 500 free credits — enough to test agent workflows across hundreds of pages.
For building knowledge bases that agents can query, see our RAG pipeline guide. For broader API comparisons, see Best Web Scraping APIs in 2026.
Ready to try SimpleCrawl?
We're building the simplest web scraping API for AI. Join the waitlist and get 500 free credits at launch.