URL to Markdown: The Complete Developer Guide
Learn how to convert any website to clean markdown. Covers libraries, APIs, and custom solutions — with code examples in Python, Node.js, and shell commands.
What Does "URL to Markdown" Mean?
Converting a URL to markdown means fetching a web page and transforming its HTML content into clean, readable markdown text — preserving headings, lists, links, code blocks, and tables while stripping away navigation, ads, scripts, and other noise.
Quick answer: To convert a URL to markdown, use a tool that fetches the page, extracts the main content, and converts the HTML structure to markdown syntax. You can do this with the SimpleCrawl API, Python libraries like
markdownifyandtrafilatura, or the free URL to Markdown tool on our site.
This is one of the most requested operations in modern development. The use cases include:
- RAG pipelines — Feed clean web content to LLMs (see our RAG pipeline guide)
- Documentation archival — Save web docs in a portable, version-controllable format
- Content migration — Move content between CMS platforms
- AI context — Give AI coding assistants project documentation as markdown context
- Read-it-later — Convert articles to distraction-free markdown for reading
The Three Challenges of URL-to-Markdown Conversion
Converting HTML to markdown sounds simple until you try it. There are three hard problems:
1. Content Extraction
A web page's HTML contains the article content mixed with headers, footers, navigation, sidebars, ads, cookie banners, and more. You need to identify and extract just the main content.
2. Structure Preservation
Good conversion preserves the document's semantic structure: heading hierarchy, ordered vs unordered lists, table alignment, code block language hints, and link destinations. A naive tag-stripping approach loses all of this.
3. JavaScript Rendering
Modern websites render content with JavaScript. If you fetch the raw HTML, you might get an empty page shell. The conversion tool needs to render JavaScript first. For more on this, see our guide on scraping JavaScript-heavy websites.
Method 1: SimpleCrawl API (Recommended)
The fastest path from URL to clean markdown is an API call. SimpleCrawl handles fetching, JavaScript rendering, content extraction, and conversion in one request:
import requests
response = requests.post(
"https://api.simplecrawl.com/v1/scrape",
headers={"Authorization": "Bearer sc_your_api_key"},
json={
"url": "https://react.dev/learn/thinking-in-react",
"format": "markdown"
}
)
data = response.json()
print(data["markdown"])
Output:
# Thinking in React
React can change how you think about the designs you look at and the
apps you build. When you build a user interface with React, you will
first break it apart into pieces called *components*...
## Step 1: Break the UI into a component hierarchy
Start by drawing boxes around every component and subcomponent in the
mockup and naming them...
Why Use an API?
- JavaScript rendering — Dynamic sites work without extra config
- Content extraction — Main content is extracted automatically, no boilerplate
- Anti-bot bypass — Protected sites work out of the box (see avoiding blocks)
- No dependencies — Works from any language, any environment
- Consistent output — Same quality across different site types
You can also try it free with our URL to Markdown tool — no API key needed.
Method 2: Python with Trafilatura + Markdownify
For a local, library-based approach, the combination of trafilatura (content extraction) and markdownify (HTML-to-markdown conversion) works well:
pip install trafilatura markdownify
import trafilatura
from markdownify import markdownify
def url_to_markdown(url: str) -> str:
"""Convert a URL to markdown using trafilatura + markdownify."""
# Fetch and extract main content
downloaded = trafilatura.fetch_url(url)
html_content = trafilatura.extract(
downloaded,
output_format="html",
include_links=True,
include_images=True,
include_tables=True,
)
if not html_content:
return ""
# Convert HTML to markdown
markdown = markdownify(
html_content,
heading_style="ATX",
bullets="-",
strip=["script", "style"],
)
return markdown.strip()
md = url_to_markdown("https://example.com/article")
print(md)
Limitations
- No JavaScript rendering — CSR pages return empty or partial content
- Extraction accuracy varies —
trafilaturauses heuristics that work well for articles but struggle with non-standard layouts - No anti-bot bypass — Protected sites will block the requests
- Tables can break — Complex HTML tables don't always convert cleanly
Method 3: Playwright + Turndown (Node.js)
For JavaScript-rendered pages using Node.js, combine Playwright for rendering with Turndown for conversion:
npm install playwright turndown @mozilla/readability jsdom
const { chromium } = require("playwright");
const TurndownService = require("turndown");
const { Readability } = require("@mozilla/readability");
const { JSDOM } = require("jsdom");
async function urlToMarkdown(url) {
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto(url, { waitUntil: "networkidle" });
const html = await page.content();
await browser.close();
// Extract main content with Readability
const dom = new JSDOM(html, { url });
const reader = new Readability(dom.window.document);
const article = reader.parse();
if (!article) return "";
// Convert to markdown
const turndown = new TurndownService({
headingStyle: "atx",
codeBlockStyle: "fenced",
});
return turndown.turndown(article.content);
}
urlToMarkdown("https://example.com/docs/getting-started").then(console.log);
This approach handles JavaScript rendering and produces good markdown, but it's slow (3-10 seconds per page), resource-heavy, and requires managing a Playwright installation.
Method 4: Pandoc (Command Line)
For offline HTML files or simple static pages, Pandoc is a powerful command-line converter:
# Convert a local HTML file
pandoc input.html -f html -t markdown -o output.md
# Convert from URL (static pages only)
curl -s "https://example.com/article" | pandoc -f html -t markdown
Pandoc produces excellent markdown from well-structured HTML. It supports tables, footnotes, math notation, and other advanced features. However, it doesn't do content extraction — you get everything in the HTML, including navigation and footers. And it can't handle JavaScript-rendered content.
Combining curl + Readability CLI + Pandoc
For a shell-based pipeline that includes content extraction:
# Install readability-cli: npm install -g @niccokunzmann/readability-cli
# Then pipe through pandoc
curl -s "https://example.com/article" \
| readable --html \
| pandoc -f html -t markdown --wrap=none
Method 5: Build Your Own Converter
For full control over the conversion logic, you can build a custom converter. Here's a Python implementation that handles the most common HTML elements:
from bs4 import BeautifulSoup
import re
class HtmlToMarkdown:
def __init__(self):
self.result = []
def convert(self, html: str) -> str:
soup = BeautifulSoup(html, "html.parser")
# Remove script, style, nav, footer, header
for tag in soup.find_all(["script", "style", "nav", "footer", "header", "aside"]):
tag.decompose()
self._process_element(soup)
text = "\n".join(self.result)
# Clean up excessive newlines
text = re.sub(r"\n{3,}", "\n\n", text)
return text.strip()
def _process_element(self, element):
if element.string:
self.result.append(element.string.strip())
return
for child in element.children:
if child.name is None:
text = child.string
if text and text.strip():
self.result.append(text.strip())
elif child.name in ["h1", "h2", "h3", "h4", "h5", "h6"]:
level = int(child.name[1])
self.result.append(f"\n{'#' * level} {child.get_text().strip()}\n")
elif child.name == "p":
self.result.append(f"\n{child.get_text().strip()}\n")
elif child.name == "a":
href = child.get("href", "")
text = child.get_text().strip()
self.result.append(f"[{text}]({href})")
elif child.name == "code":
if child.parent and child.parent.name == "pre":
lang = ""
classes = child.get("class", [])
for cls in classes:
if cls.startswith("language-"):
lang = cls.replace("language-", "")
self.result.append(f"\n```{lang}\n{child.get_text()}\n```\n")
else:
self.result.append(f"`{child.get_text()}`")
elif child.name == "pre":
self._process_element(child)
elif child.name in ["ul", "ol"]:
self._process_list(child, child.name)
elif child.name == "blockquote":
text = child.get_text().strip()
quoted = "\n".join(f"> {line}" for line in text.split("\n"))
self.result.append(f"\n{quoted}\n")
elif child.name == "img":
alt = child.get("alt", "")
src = child.get("src", "")
self.result.append(f"")
elif child.name == "strong" or child.name == "b":
self.result.append(f"**{child.get_text().strip()}**")
elif child.name == "em" or child.name == "i":
self.result.append(f"*{child.get_text().strip()}*")
elif child.name == "hr":
self.result.append("\n---\n")
elif child.name == "br":
self.result.append("\n")
else:
self._process_element(child)
def _process_list(self, element, list_type):
items = element.find_all("li", recursive=False)
for i, item in enumerate(items):
prefix = f"{i + 1}." if list_type == "ol" else "-"
self.result.append(f"{prefix} {item.get_text().strip()}")
self.result.append("")
converter = HtmlToMarkdown()
markdown = converter.convert(html_content)
This is educational and useful for custom requirements, but for production use you'll spend significant time handling edge cases — nested tables, complex list structures, embedded media, and non-standard HTML patterns.
Comparing the Approaches
| Method | JS Rendering | Content Extraction | Code Blocks | Tables | Speed | Setup |
|---|---|---|---|---|---|---|
| SimpleCrawl API | Yes | Yes | Excellent | Excellent | Fast | None |
| Trafilatura + Markdownify | No | Good | Good | Fair | Fast | Minimal |
| Playwright + Turndown | Yes | Good | Good | Good | Slow | Heavy |
| Pandoc | No | None | Excellent | Excellent | Fast | Minimal |
| Custom converter | Depends | Custom | Custom | Custom | Varies | Heavy |
For most developers, the API approach is the right default. It handles the hard parts (JavaScript rendering, content extraction, anti-bot bypass) and returns consistent results across different site types.
Handling Edge Cases
Tables with Complex Formatting
HTML tables often have merged cells, nested tables, or style-based layouts that don't map cleanly to markdown tables. The best approach is to simplify:
def simplify_table(html_table: str) -> str:
"""Convert an HTML table to a simple markdown table."""
soup = BeautifulSoup(html_table, "html.parser")
rows = soup.find_all("tr")
if not rows:
return ""
md_rows = []
for row in rows:
cells = row.find_all(["td", "th"])
md_row = "| " + " | ".join(cell.get_text().strip() for cell in cells) + " |"
md_rows.append(md_row)
if len(md_rows) > 1:
# Add separator after header
num_cols = md_rows[0].count("|") - 1
separator = "|" + "|".join(" --- " for _ in range(num_cols)) + "|"
md_rows.insert(1, separator)
return "\n".join(md_rows)
Code Blocks with Syntax Highlighting
Detecting the programming language of a code block:
def detect_code_language(code_element) -> str:
"""Try to detect the programming language from HTML class attributes."""
classes = code_element.get("class", [])
for cls in classes:
for prefix in ["language-", "lang-", "highlight-", "brush:"]:
if cls.startswith(prefix):
return cls.replace(prefix, "").strip()
# Check parent element
parent = code_element.parent
if parent:
parent_classes = parent.get("class", [])
for cls in parent_classes:
for prefix in ["language-", "lang-"]:
if cls.startswith(prefix):
return cls.replace(prefix, "").strip()
return ""
Relative URLs
When converting to markdown, relative URLs need to be resolved to absolute URLs:
from urllib.parse import urljoin
def resolve_urls(markdown: str, base_url: str) -> str:
"""Convert relative URLs in markdown to absolute URLs."""
import re
def resolve_match(match):
text = match.group(1)
url = match.group(2)
if not url.startswith(("http://", "https://", "mailto:", "#")):
url = urljoin(base_url, url)
return f"[{text}]({url})"
return re.sub(r'\[([^\]]*)\]\(([^)]+)\)', resolve_match, markdown)
Batch Conversion: Converting Entire Sites
For documentation sites or content archives, you often need to convert many pages at once:
import requests
from concurrent.futures import ThreadPoolExecutor
API_BASE = "https://api.simplecrawl.com/v1"
HEADERS = {"Authorization": "Bearer sc_your_api_key"}
def convert_url(url: str) -> dict:
"""Convert a single URL to markdown."""
response = requests.post(
f"{API_BASE}/scrape",
headers=HEADERS,
json={"url": url, "format": "markdown"}
)
data = response.json()
return {
"url": url,
"title": data.get("title", ""),
"markdown": data.get("markdown", ""),
}
def batch_convert(urls: list[str], workers: int = 5) -> list[dict]:
"""Convert multiple URLs to markdown in parallel."""
with ThreadPoolExecutor(max_workers=workers) as executor:
results = list(executor.map(convert_url, urls))
return results
# Convert an entire documentation site
urls = [
"https://docs.example.com/getting-started",
"https://docs.example.com/api-reference",
"https://docs.example.com/authentication",
"https://docs.example.com/rate-limits",
]
docs = batch_convert(urls)
# Save as individual files
for doc in docs:
filename = doc["url"].split("/")[-1] + ".md"
with open(filename, "w") as f:
f.write(f"# {doc['title']}\n\n{doc['markdown']}")
print(f"Saved {filename}")
Output Quality: What Good Markdown Looks Like
Good URL-to-markdown conversion produces output that:
- Preserves heading hierarchy — H1 through H6 map to
#through###### - Keeps links functional — Both inline and reference-style links work
- Formats code correctly — Inline code uses backticks, blocks use fenced syntax with language hints
- Renders tables — Column alignment and header separation are maintained
- Handles images — Alt text and URLs are preserved in
format - Strips noise — No navigation, ads, cookie banners, or related article widgets
- Reads naturally — A human can read the markdown file comfortably
FAQ
What is the best tool to convert a URL to markdown?
For a quick conversion, use the free SimpleCrawl URL to Markdown tool — no signup needed. For programmatic use, the SimpleCrawl API handles JavaScript rendering, content extraction, and conversion in a single call. For local/offline use, the combination of trafilatura + markdownify in Python is solid.
Can I convert a JavaScript-rendered page to markdown?
Yes, but you need a tool that executes JavaScript. The SimpleCrawl API renders JavaScript automatically. For a local approach, use Playwright to render the page, then extract and convert the HTML. Libraries like trafilatura that only fetch raw HTML will miss content on JavaScript-heavy sites.
How do I preserve code blocks when converting to markdown?
Look for <pre> and <code> elements in the HTML. Check the class attribute for language hints (e.g., class="language-python"). Wrap the content in fenced code blocks with the appropriate language tag. SimpleCrawl handles this automatically and correctly identifies language hints from most common syntax highlighting libraries.
Is there a free URL to markdown converter?
Yes — SimpleCrawl's URL to Markdown tool is free to use with no signup. For programmatic access, the SimpleCrawl API free tier includes 500 credits per month. For fully local solutions, pandoc and trafilatura are open-source.
How do I handle images when converting to markdown?
Images are converted to  syntax. Relative image URLs should be resolved to absolute URLs. If you need the images locally, download them separately and update the markdown paths. SimpleCrawl returns images as absolute URLs in the markdown output.
Ready to try SimpleCrawl?
We're building the simplest web scraping API for AI. Join the waitlist and get 500 free credits at launch.