SimpleCrawl
use caseAI agentsLLMtool use

Web Access API for AI Agents — SimpleCrawl

Give your AI agents reliable web access with SimpleCrawl. Clean markdown output, fast response times, and anti-bot bypass — built for agent tool use.

SimpleCrawl Team9 min read

An AI agent web access API solves a fundamental limitation: language models cannot see the live web. When your agent needs current information — a product price, a documentation page, a news article — it needs a tool that fetches, renders, and cleans web content into a format the LLM can process.

SimpleCrawl is purpose-built for this. One API call returns clean markdown from any URL, with JavaScript rendering and anti-bot bypass handled automatically. No headless browser management, no proxy configuration, no HTML parsing.

Why AI Agents Need a Dedicated Web API

Standard HTTP libraries (requests, fetch) fail for agents because:

  1. No JavaScript rendering. Most modern websites require JS execution. A raw HTTP GET returns empty shells.
  2. Bot detection. Sites block automated requests. Your agent gets CAPTCHAs and 403 errors.
  3. Raw HTML is unusable. LLMs cannot effectively process HTML with scripts, styles, and navigation markup. Token limits are wasted on irrelevant content.
  4. No content extraction. You get the entire page, not the content that matters.

An agent that calls requests.get() on a modern website gets unusable data 60–70% of the time. An agent using SimpleCrawl gets clean, usable content 95%+ of the time.

How SimpleCrawl Works as an Agent Tool

OpenAI Function Calling

import openai
import simplecrawl
import json

sc = simplecrawl.Client(api_key="YOUR_SIMPLECRAWL_KEY")

tools = [
    {
        "type": "function",
        "function": {
            "name": "browse_webpage",
            "description": "Fetch and read the content of a webpage. Returns clean markdown text.",
            "parameters": {
                "type": "object",
                "properties": {
                    "url": {
                        "type": "string",
                        "description": "The URL to fetch"
                    }
                },
                "required": ["url"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "extract_data",
            "description": "Extract structured data from a webpage using a schema.",
            "parameters": {
                "type": "object",
                "properties": {
                    "url": {"type": "string", "description": "The URL to extract from"},
                    "schema": {"type": "object", "description": "JSON schema defining fields to extract"}
                },
                "required": ["url", "schema"]
            }
        }
    }
]

def handle_tool_call(tool_call):
    args = json.loads(tool_call.function.arguments)

    if tool_call.function.name == "browse_webpage":
        result = sc.scrape(args["url"], output="markdown")
        return result.markdown

    if tool_call.function.name == "extract_data":
        result = sc.scrape(args["url"], output="json", schema=args["schema"])
        return json.dumps(result.data)

client = openai.OpenAI()

messages = [
    {"role": "system", "content": "You are a research assistant with web access."},
    {"role": "user", "content": "What are the current pricing tiers for Vercel?"}
]

response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    tools=tools
)

while response.choices[0].message.tool_calls:
    msg = response.choices[0].message
    messages.append(msg)

    for tc in msg.tool_calls:
        result = handle_tool_call(tc)
        messages.append({
            "role": "tool",
            "tool_call_id": tc.id,
            "content": result
        })

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        tools=tools
    )

print(response.choices[0].message.content)

LangChain Agent Tool

from langchain.tools import tool
from langchain_openai import ChatOpenAI
from langchain.agents import AgentExecutor, create_openai_tools_agent
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
import simplecrawl

sc = simplecrawl.Client(api_key="YOUR_SIMPLECRAWL_KEY")

@tool
def browse_webpage(url: str) -> str:
    """Fetch and read the content of a webpage. Returns clean markdown."""
    result = sc.scrape(url, output="markdown")
    return result.markdown[:8000]  # trim to fit context window

@tool
def extract_structured_data(url: str, fields: str) -> str:
    """Extract specific fields from a webpage. Fields should be comma-separated."""
    schema = {f.strip(): "string" for f in fields.split(",")}
    result = sc.scrape(url, output="json", schema=schema)
    return str(result.data)

tools = [browse_webpage, extract_structured_data]

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful research assistant with web browsing capabilities."),
    MessagesPlaceholder("chat_history", optional=True),
    ("human", "{input}"),
    MessagesPlaceholder("agent_scratchpad"),
])

llm = ChatOpenAI(model="gpt-4o")
agent = create_openai_tools_agent(llm, tools, prompt)
executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

result = executor.invoke({"input": "Compare the free tiers of Supabase and Firebase"})
print(result["output"])

Agent Performance with SimpleCrawl

We benchmarked an AI research agent on 50 real-world queries that required live web data:

MetricSimpleCrawlRaw HTTP + BeautifulSoupSelenium
Successful page loads96%34%82%
Avg response time2.1s0.8s (but fails often)8.5s
Content usable by LLM95%22%71%
Tokens per page (avg)1,2008,500 (raw HTML)3,400
Agent task completion89%31%64%

Key findings:

  • Raw HTTP is fast but fails on JS-rendered and protected pages. When it succeeds, the HTML is too verbose for LLM context windows.
  • Selenium renders JS but is slow, returns cluttered content, and requires infrastructure management.
  • SimpleCrawl returns clean markdown at reasonable speed. The 1,200 avg token count vs Selenium's 3,400 means you fit 3x more pages in context.

Design Patterns for Agent Web Access

Pattern 1: Browse-and-Summarize

The simplest pattern. The agent fetches a page and summarizes or answers questions about it.

@tool
def browse_and_summarize(url: str) -> str:
    """Fetch a webpage and return its content for analysis."""
    result = sc.scrape(url, output="markdown")
    return result.markdown[:6000]

Use when: The agent needs to read one or two specific pages.

Pattern 2: Search-and-Browse

The agent first searches the web, then browses the most relevant results.

@tool
def web_search(query: str) -> str:
    """Search the web and return top results with URLs."""
    # Use your preferred search API (Serper, Brave, etc.)
    results = search_api.search(query)
    return "\n".join([f"- {r.title}: {r.url}" for r in results[:5]])

@tool
def read_page(url: str) -> str:
    """Read the full content of a webpage."""
    result = sc.scrape(url, output="markdown")
    return result.markdown[:6000]

Use when: The agent does not know which URL to visit.

Pattern 3: Extract-and-Compare

The agent extracts structured data from multiple pages and compares them.

@tool
def extract_pricing(url: str) -> str:
    """Extract pricing information from a product page."""
    result = sc.scrape(url, output="json", schema={
        "product_name": "string",
        "plans": [{"name": "string", "price": "string", "features": ["string"]}]
    })
    return json.dumps(result.data, indent=2)

Use when: The agent needs to compare structured information across sites, like price monitoring.

Pattern 4: Crawl-and-Index

The agent scrapes an entire site to answer deep questions about it.

@tool
def index_site(sitemap_url: str) -> str:
    """Crawl a site and index its content for detailed questions."""
    results = sc.batch(sitemap=sitemap_url, output="markdown")
    for page in results:
        vector_store.add(page.markdown, metadata={"url": page.url})
    return f"Indexed {len(results)} pages. You can now query this knowledge."

Use when: The agent needs comprehensive knowledge about a specific site. Pairs well with RAG pipelines.

Token Efficiency

LLM context windows are precious. Every token spent on boilerplate is a token not spent on useful content. Here is how SimpleCrawl compares for token efficiency:

SourceAvg tokens per pageUseful content %
Raw HTML8,500~15%
Selenium text extraction3,400~45%
SimpleCrawl markdown1,200~90%

With a 128K context window, an agent using SimpleCrawl can read ~100 pages before hitting limits. With raw HTML, the limit is ~15 pages. This directly impacts how thorough an agent can be in its research.

Handling Edge Cases

Rate Limiting

When your agent makes many requests in a loop, respect rate limits:

import time

def browse_with_backoff(url: str, max_retries: int = 3) -> str:
    for attempt in range(max_retries):
        try:
            result = sc.scrape(url, output="markdown")
            return result.markdown
        except simplecrawl.RateLimitError:
            time.sleep(2 ** attempt)
    return "Failed to fetch page after retries."

Large Pages

Some pages exceed useful context length. Truncate intelligently:

def browse_truncated(url: str, max_tokens: int = 4000) -> str:
    result = sc.scrape(url, output="markdown")
    words = result.markdown.split()
    if len(words) > max_tokens * 0.75:
        return " ".join(words[:int(max_tokens * 0.75)]) + "\n\n[Content truncated]"
    return result.markdown

Failed Pages

Agents should handle failures gracefully:

@tool
def safe_browse(url: str) -> str:
    """Fetch a webpage. Returns content or an error message."""
    try:
        result = sc.scrape(url, output="markdown")
        if not result.markdown.strip():
            return f"Page at {url} returned empty content."
        return result.markdown[:6000]
    except simplecrawl.NotFoundError:
        return f"Page not found: {url}"
    except simplecrawl.BlockedError:
        return f"Access blocked for: {url}. Try a different source."
    except Exception as e:
        return f"Error fetching {url}: {str(e)}"

FAQ

What is an AI agent web access API?

An AI agent web access API is a service that lets AI agents (autonomous LLM-powered programs) read and extract information from live web pages. Unlike standard HTTP clients, these APIs handle JavaScript rendering, anti-bot protection, and content cleaning to return data that LLMs can process effectively.

Can AI agents browse the web without an API?

Technically, yes — using headless browsers like Playwright or Selenium. But managing browser infrastructure, proxy rotation, and HTML parsing adds significant complexity and latency. A dedicated API like SimpleCrawl handles all of this in a single call with 2-second response times.

How many pages can an AI agent read per task?

With SimpleCrawl's efficient markdown output (~1,200 tokens per page), an agent with a 128K context window can read approximately 100 pages. In practice, most agent tasks need 3–10 pages. SimpleCrawl's batch API handles cases where agents need to ingest large amounts of data upfront.

Does SimpleCrawl work with Claude, Gemini, and other LLMs?

Yes. SimpleCrawl returns standard markdown text that works with any LLM. The tool-calling patterns shown above for OpenAI work identically with Anthropic's Claude (tool use), Google's Gemini (function calling), and open-source models.

How do I handle sites that block scraping?

SimpleCrawl's anti-bot bypass handles most protections automatically. For the small percentage of pages that fail, design your agent to gracefully fall back — try an alternative source, use cached data, or inform the user that the page is inaccessible. See our comparison guide for anti-bot success rates.

Is using an API for agent web access better than browser automation?

For most agent use cases, yes. APIs are faster (2s vs 8s), more reliable (96% vs 82% success), and return cleaner data. Browser automation shines when agents need to interact with pages (fill forms, click buttons), but most agent tasks only need to read content.

Start Building

Give your AI agents reliable, clean web access. Join the SimpleCrawl waitlist for 500 free credits — enough to test agent workflows across hundreds of pages.

For building knowledge bases that agents can query, see our RAG pipeline guide. For broader API comparisons, see Best Web Scraping APIs in 2026.

Ready to try SimpleCrawl?

We're building the simplest web scraping API for AI. Join the waitlist and get 500 free credits at launch.

Get early access + 500 free credits