SimpleCrawl
Back to Blog
MarkdownWeb ScrapingDeveloper Tools

URL to Markdown: The Complete Developer Guide

Learn how to convert any website to clean markdown. Covers libraries, APIs, and custom solutions — with code examples in Python, Node.js, and shell commands.

SimpleCrawl TeamMarch 4, 202611 min read

What Does "URL to Markdown" Mean?

Converting a URL to markdown means fetching a web page and transforming its HTML content into clean, readable markdown text — preserving headings, lists, links, code blocks, and tables while stripping away navigation, ads, scripts, and other noise.

Quick answer: To convert a URL to markdown, use a tool that fetches the page, extracts the main content, and converts the HTML structure to markdown syntax. You can do this with the SimpleCrawl API, Python libraries like markdownify and trafilatura, or the free URL to Markdown tool on our site.

This is one of the most requested operations in modern development. The use cases include:

  • RAG pipelines — Feed clean web content to LLMs (see our RAG pipeline guide)
  • Documentation archival — Save web docs in a portable, version-controllable format
  • Content migration — Move content between CMS platforms
  • AI context — Give AI coding assistants project documentation as markdown context
  • Read-it-later — Convert articles to distraction-free markdown for reading

The Three Challenges of URL-to-Markdown Conversion

Converting HTML to markdown sounds simple until you try it. There are three hard problems:

1. Content Extraction

A web page's HTML contains the article content mixed with headers, footers, navigation, sidebars, ads, cookie banners, and more. You need to identify and extract just the main content.

2. Structure Preservation

Good conversion preserves the document's semantic structure: heading hierarchy, ordered vs unordered lists, table alignment, code block language hints, and link destinations. A naive tag-stripping approach loses all of this.

3. JavaScript Rendering

Modern websites render content with JavaScript. If you fetch the raw HTML, you might get an empty page shell. The conversion tool needs to render JavaScript first. For more on this, see our guide on scraping JavaScript-heavy websites.

The fastest path from URL to clean markdown is an API call. SimpleCrawl handles fetching, JavaScript rendering, content extraction, and conversion in one request:

import requests

response = requests.post(
    "https://api.simplecrawl.com/v1/scrape",
    headers={"Authorization": "Bearer sc_your_api_key"},
    json={
        "url": "https://react.dev/learn/thinking-in-react",
        "format": "markdown"
    }
)

data = response.json()
print(data["markdown"])

Output:

# Thinking in React

React can change how you think about the designs you look at and the
apps you build. When you build a user interface with React, you will
first break it apart into pieces called *components*...

## Step 1: Break the UI into a component hierarchy

Start by drawing boxes around every component and subcomponent in the
mockup and naming them...

Why Use an API?

  • JavaScript rendering — Dynamic sites work without extra config
  • Content extraction — Main content is extracted automatically, no boilerplate
  • Anti-bot bypass — Protected sites work out of the box (see avoiding blocks)
  • No dependencies — Works from any language, any environment
  • Consistent output — Same quality across different site types

You can also try it free with our URL to Markdown tool — no API key needed.

Method 2: Python with Trafilatura + Markdownify

For a local, library-based approach, the combination of trafilatura (content extraction) and markdownify (HTML-to-markdown conversion) works well:

pip install trafilatura markdownify
import trafilatura
from markdownify import markdownify

def url_to_markdown(url: str) -> str:
    """Convert a URL to markdown using trafilatura + markdownify."""
    # Fetch and extract main content
    downloaded = trafilatura.fetch_url(url)
    html_content = trafilatura.extract(
        downloaded,
        output_format="html",
        include_links=True,
        include_images=True,
        include_tables=True,
    )

    if not html_content:
        return ""

    # Convert HTML to markdown
    markdown = markdownify(
        html_content,
        heading_style="ATX",
        bullets="-",
        strip=["script", "style"],
    )

    return markdown.strip()

md = url_to_markdown("https://example.com/article")
print(md)

Limitations

  • No JavaScript rendering — CSR pages return empty or partial content
  • Extraction accuracy variestrafilatura uses heuristics that work well for articles but struggle with non-standard layouts
  • No anti-bot bypass — Protected sites will block the requests
  • Tables can break — Complex HTML tables don't always convert cleanly

Method 3: Playwright + Turndown (Node.js)

For JavaScript-rendered pages using Node.js, combine Playwright for rendering with Turndown for conversion:

npm install playwright turndown @mozilla/readability jsdom
const { chromium } = require("playwright");
const TurndownService = require("turndown");
const { Readability } = require("@mozilla/readability");
const { JSDOM } = require("jsdom");

async function urlToMarkdown(url) {
  const browser = await chromium.launch();
  const page = await browser.newPage();
  await page.goto(url, { waitUntil: "networkidle" });

  const html = await page.content();
  await browser.close();

  // Extract main content with Readability
  const dom = new JSDOM(html, { url });
  const reader = new Readability(dom.window.document);
  const article = reader.parse();

  if (!article) return "";

  // Convert to markdown
  const turndown = new TurndownService({
    headingStyle: "atx",
    codeBlockStyle: "fenced",
  });

  return turndown.turndown(article.content);
}

urlToMarkdown("https://example.com/docs/getting-started").then(console.log);

This approach handles JavaScript rendering and produces good markdown, but it's slow (3-10 seconds per page), resource-heavy, and requires managing a Playwright installation.

Method 4: Pandoc (Command Line)

For offline HTML files or simple static pages, Pandoc is a powerful command-line converter:

# Convert a local HTML file
pandoc input.html -f html -t markdown -o output.md

# Convert from URL (static pages only)
curl -s "https://example.com/article" | pandoc -f html -t markdown

Pandoc produces excellent markdown from well-structured HTML. It supports tables, footnotes, math notation, and other advanced features. However, it doesn't do content extraction — you get everything in the HTML, including navigation and footers. And it can't handle JavaScript-rendered content.

Combining curl + Readability CLI + Pandoc

For a shell-based pipeline that includes content extraction:

# Install readability-cli: npm install -g @niccokunzmann/readability-cli
# Then pipe through pandoc

curl -s "https://example.com/article" \
  | readable --html \
  | pandoc -f html -t markdown --wrap=none

Method 5: Build Your Own Converter

For full control over the conversion logic, you can build a custom converter. Here's a Python implementation that handles the most common HTML elements:

from bs4 import BeautifulSoup
import re

class HtmlToMarkdown:
    def __init__(self):
        self.result = []

    def convert(self, html: str) -> str:
        soup = BeautifulSoup(html, "html.parser")

        # Remove script, style, nav, footer, header
        for tag in soup.find_all(["script", "style", "nav", "footer", "header", "aside"]):
            tag.decompose()

        self._process_element(soup)
        text = "\n".join(self.result)

        # Clean up excessive newlines
        text = re.sub(r"\n{3,}", "\n\n", text)
        return text.strip()

    def _process_element(self, element):
        if element.string:
            self.result.append(element.string.strip())
            return

        for child in element.children:
            if child.name is None:
                text = child.string
                if text and text.strip():
                    self.result.append(text.strip())
            elif child.name in ["h1", "h2", "h3", "h4", "h5", "h6"]:
                level = int(child.name[1])
                self.result.append(f"\n{'#' * level} {child.get_text().strip()}\n")
            elif child.name == "p":
                self.result.append(f"\n{child.get_text().strip()}\n")
            elif child.name == "a":
                href = child.get("href", "")
                text = child.get_text().strip()
                self.result.append(f"[{text}]({href})")
            elif child.name == "code":
                if child.parent and child.parent.name == "pre":
                    lang = ""
                    classes = child.get("class", [])
                    for cls in classes:
                        if cls.startswith("language-"):
                            lang = cls.replace("language-", "")
                    self.result.append(f"\n```{lang}\n{child.get_text()}\n```\n")
                else:
                    self.result.append(f"`{child.get_text()}`")
            elif child.name == "pre":
                self._process_element(child)
            elif child.name in ["ul", "ol"]:
                self._process_list(child, child.name)
            elif child.name == "blockquote":
                text = child.get_text().strip()
                quoted = "\n".join(f"> {line}" for line in text.split("\n"))
                self.result.append(f"\n{quoted}\n")
            elif child.name == "img":
                alt = child.get("alt", "")
                src = child.get("src", "")
                self.result.append(f"![{alt}]({src})")
            elif child.name == "strong" or child.name == "b":
                self.result.append(f"**{child.get_text().strip()}**")
            elif child.name == "em" or child.name == "i":
                self.result.append(f"*{child.get_text().strip()}*")
            elif child.name == "hr":
                self.result.append("\n---\n")
            elif child.name == "br":
                self.result.append("\n")
            else:
                self._process_element(child)

    def _process_list(self, element, list_type):
        items = element.find_all("li", recursive=False)
        for i, item in enumerate(items):
            prefix = f"{i + 1}." if list_type == "ol" else "-"
            self.result.append(f"{prefix} {item.get_text().strip()}")
        self.result.append("")

converter = HtmlToMarkdown()
markdown = converter.convert(html_content)

This is educational and useful for custom requirements, but for production use you'll spend significant time handling edge cases — nested tables, complex list structures, embedded media, and non-standard HTML patterns.

Comparing the Approaches

MethodJS RenderingContent ExtractionCode BlocksTablesSpeedSetup
SimpleCrawl APIYesYesExcellentExcellentFastNone
Trafilatura + MarkdownifyNoGoodGoodFairFastMinimal
Playwright + TurndownYesGoodGoodGoodSlowHeavy
PandocNoNoneExcellentExcellentFastMinimal
Custom converterDependsCustomCustomCustomVariesHeavy

For most developers, the API approach is the right default. It handles the hard parts (JavaScript rendering, content extraction, anti-bot bypass) and returns consistent results across different site types.

Handling Edge Cases

Tables with Complex Formatting

HTML tables often have merged cells, nested tables, or style-based layouts that don't map cleanly to markdown tables. The best approach is to simplify:

def simplify_table(html_table: str) -> str:
    """Convert an HTML table to a simple markdown table."""
    soup = BeautifulSoup(html_table, "html.parser")
    rows = soup.find_all("tr")

    if not rows:
        return ""

    md_rows = []
    for row in rows:
        cells = row.find_all(["td", "th"])
        md_row = "| " + " | ".join(cell.get_text().strip() for cell in cells) + " |"
        md_rows.append(md_row)

    if len(md_rows) > 1:
        # Add separator after header
        num_cols = md_rows[0].count("|") - 1
        separator = "|" + "|".join(" --- " for _ in range(num_cols)) + "|"
        md_rows.insert(1, separator)

    return "\n".join(md_rows)

Code Blocks with Syntax Highlighting

Detecting the programming language of a code block:

def detect_code_language(code_element) -> str:
    """Try to detect the programming language from HTML class attributes."""
    classes = code_element.get("class", [])
    for cls in classes:
        for prefix in ["language-", "lang-", "highlight-", "brush:"]:
            if cls.startswith(prefix):
                return cls.replace(prefix, "").strip()

    # Check parent element
    parent = code_element.parent
    if parent:
        parent_classes = parent.get("class", [])
        for cls in parent_classes:
            for prefix in ["language-", "lang-"]:
                if cls.startswith(prefix):
                    return cls.replace(prefix, "").strip()

    return ""

Relative URLs

When converting to markdown, relative URLs need to be resolved to absolute URLs:

from urllib.parse import urljoin

def resolve_urls(markdown: str, base_url: str) -> str:
    """Convert relative URLs in markdown to absolute URLs."""
    import re

    def resolve_match(match):
        text = match.group(1)
        url = match.group(2)
        if not url.startswith(("http://", "https://", "mailto:", "#")):
            url = urljoin(base_url, url)
        return f"[{text}]({url})"

    return re.sub(r'\[([^\]]*)\]\(([^)]+)\)', resolve_match, markdown)

Batch Conversion: Converting Entire Sites

For documentation sites or content archives, you often need to convert many pages at once:

import requests
from concurrent.futures import ThreadPoolExecutor

API_BASE = "https://api.simplecrawl.com/v1"
HEADERS = {"Authorization": "Bearer sc_your_api_key"}

def convert_url(url: str) -> dict:
    """Convert a single URL to markdown."""
    response = requests.post(
        f"{API_BASE}/scrape",
        headers=HEADERS,
        json={"url": url, "format": "markdown"}
    )
    data = response.json()
    return {
        "url": url,
        "title": data.get("title", ""),
        "markdown": data.get("markdown", ""),
    }

def batch_convert(urls: list[str], workers: int = 5) -> list[dict]:
    """Convert multiple URLs to markdown in parallel."""
    with ThreadPoolExecutor(max_workers=workers) as executor:
        results = list(executor.map(convert_url, urls))
    return results

# Convert an entire documentation site
urls = [
    "https://docs.example.com/getting-started",
    "https://docs.example.com/api-reference",
    "https://docs.example.com/authentication",
    "https://docs.example.com/rate-limits",
]

docs = batch_convert(urls)

# Save as individual files
for doc in docs:
    filename = doc["url"].split("/")[-1] + ".md"
    with open(filename, "w") as f:
        f.write(f"# {doc['title']}\n\n{doc['markdown']}")
    print(f"Saved {filename}")

Output Quality: What Good Markdown Looks Like

Good URL-to-markdown conversion produces output that:

  1. Preserves heading hierarchy — H1 through H6 map to # through ######
  2. Keeps links functional — Both inline and reference-style links work
  3. Formats code correctly — Inline code uses backticks, blocks use fenced syntax with language hints
  4. Renders tables — Column alignment and header separation are maintained
  5. Handles images — Alt text and URLs are preserved in ![alt](url) format
  6. Strips noise — No navigation, ads, cookie banners, or related article widgets
  7. Reads naturally — A human can read the markdown file comfortably

FAQ

What is the best tool to convert a URL to markdown?

For a quick conversion, use the free SimpleCrawl URL to Markdown tool — no signup needed. For programmatic use, the SimpleCrawl API handles JavaScript rendering, content extraction, and conversion in a single call. For local/offline use, the combination of trafilatura + markdownify in Python is solid.

Can I convert a JavaScript-rendered page to markdown?

Yes, but you need a tool that executes JavaScript. The SimpleCrawl API renders JavaScript automatically. For a local approach, use Playwright to render the page, then extract and convert the HTML. Libraries like trafilatura that only fetch raw HTML will miss content on JavaScript-heavy sites.

How do I preserve code blocks when converting to markdown?

Look for <pre> and <code> elements in the HTML. Check the class attribute for language hints (e.g., class="language-python"). Wrap the content in fenced code blocks with the appropriate language tag. SimpleCrawl handles this automatically and correctly identifies language hints from most common syntax highlighting libraries.

Is there a free URL to markdown converter?

Yes — SimpleCrawl's URL to Markdown tool is free to use with no signup. For programmatic access, the SimpleCrawl API free tier includes 500 credits per month. For fully local solutions, pandoc and trafilatura are open-source.

How do I handle images when converting to markdown?

Images are converted to ![alt text](image-url) syntax. Relative image URLs should be resolved to absolute URLs. If you need the images locally, download them separately and update the markdown paths. SimpleCrawl returns images as absolute URLs in the markdown output.

Ready to try SimpleCrawl?

We're building the simplest web scraping API for AI. Join the waitlist and get 500 free credits at launch.

Get early access + 500 free credits