SimpleCrawl

How to Scrape YouTube — Complete Guide (2026)

Learn how to scrape YouTube video data, comments, channel statistics, and transcripts. Compare Python scrapers with the SimpleCrawl API for YouTube data extraction.

6 min read

YouTube is the world's largest video platform with over 800 million videos and 2.7 billion monthly active users. Scraping YouTube data enables content analysis, competitive intelligence, influencer marketing research, and trend detection at scale. This guide covers practical methods for extracting video metadata, comments, channel data, and transcripts from YouTube.

What Data Can You Extract from YouTube?

YouTube pages contain multiple layers of extractable data:

  • Video metadata — title, description, view count, like count, publish date, duration, resolution, category, tags, thumbnail URL
  • Channel data — name, subscriber count, total videos, total views, join date, description, links, verification status
  • Comments — text, author, timestamp, likes, reply count, pinned status, hearted by creator
  • Search results — videos matching a query, ranked by relevance/date/views, with thumbnails and snippets
  • Transcripts/captions — auto-generated and manual captions in multiple languages
  • Playlist data — title, video list, total duration, creator, creation date
  • Trending data — trending videos by category and region
  • Engagement metrics — likes, views, comments, shares over time

This data powers influencer marketing platforms, content aggregation tools, media monitoring dashboards, and AI training pipelines.

Challenges When Scraping YouTube

YouTube (owned by Google) has sophisticated anti-scraping measures:

Heavy JavaScript Rendering

YouTube is a complex SPA built on Polymer/Web Components. Video metadata, comments, and recommendations load dynamically through internal API calls. Static HTML fetches return minimal useful data.

Comment Pagination

YouTube loads comments lazily — the initial page shows only a few comments, with more loading on scroll. Popular videos have millions of comments, requiring complex scroll-and-load automation.

Anti-Bot Detection

YouTube uses Google's enterprise anti-bot stack: reCAPTCHA, fingerprinting, request pattern analysis, and session tracking. Automated browsing is detected through subtle signals like font rendering and plugin enumeration.

Rate Limiting

YouTube throttles requests aggressively. Too many requests from one IP triggers CAPTCHA challenges or temporary blocks. This applies to both web scraping and YouTube Data API requests.

YouTube serves cookie consent screens (especially in the EU) that block content until dismissed. Scrapers must handle these interstitial pages before accessing actual content.

Method 1: Using SimpleCrawl API (Easiest)

SimpleCrawl renders YouTube pages fully, handles consent screens, and returns structured video data:

curl -X POST https://api.simplecrawl.com/v1/scrape \
  -H "Authorization: Bearer sc_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ",
    "format": "extract",
    "schema": {
      "title": "string",
      "channel": "string",
      "views": "string",
      "likes": "string",
      "publish_date": "string",
      "description": "string",
      "duration": "string",
      "comments": [{
        "author": "string",
        "text": "string",
        "likes": "number"
      }]
    }
  }'

For channel data:

{
  "url": "https://www.youtube.com/@veritasium",
  "format": "extract",
  "schema": {
    "channel_name": "string",
    "subscribers": "string",
    "total_videos": "number",
    "description": "string",
    "recent_videos": [{
      "title": "string",
      "views": "string",
      "published": "string",
      "duration": "string"
    }]
  }
}

For markdown output (useful for feeding video content into RAG pipelines):

curl -X POST https://api.simplecrawl.com/v1/scrape \
  -H "Authorization: Bearer sc_your_api_key" \
  -d '{"url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ", "format": "markdown"}'

Method 2: DIY with Python (Manual)

Using the YouTube Data API

The official API is the most reliable approach for basic metadata:

from googleapiclient.discovery import build

API_KEY = "YOUR_YOUTUBE_API_KEY"
youtube = build("youtube", "v3", developerKey=API_KEY)

def get_video_data(video_id: str) -> dict:
    request = youtube.videos().list(
        part="snippet,statistics,contentDetails",
        id=video_id,
    )
    response = request.execute()

    if not response["items"]:
        return {"error": "Video not found"}

    item = response["items"][0]
    return {
        "title": item["snippet"]["title"],
        "channel": item["snippet"]["channelTitle"],
        "description": item["snippet"]["description"],
        "views": int(item["statistics"]["viewCount"]),
        "likes": int(item["statistics"].get("likeCount", 0)),
        "comments": int(item["statistics"].get("commentCount", 0)),
        "duration": item["contentDetails"]["duration"],
        "published": item["snippet"]["publishedAt"],
        "tags": item["snippet"].get("tags", []),
    }

video = get_video_data("dQw4w9WgXcQ")
print(f"{video['title']} — {video['views']:,} views")

Scraping Comments with yt-dlp

import subprocess
import json

def get_youtube_comments(video_url: str, max_comments: int = 100) -> list:
    cmd = [
        "yt-dlp",
        "--write-comments",
        "--skip-download",
        "--no-write-thumbnail",
        "-o", "%(id)s",
        "--dump-json",
        video_url,
    ]

    result = subprocess.run(cmd, capture_output=True, text=True)
    data = json.loads(result.stdout)

    comments = []
    for c in data.get("comments", [])[:max_comments]:
        comments.append({
            "author": c.get("author"),
            "text": c.get("text"),
            "likes": c.get("like_count", 0),
            "timestamp": c.get("timestamp"),
        })

    return comments

Extracting Transcripts

from youtube_transcript_api import YouTubeTranscriptApi

def get_transcript(video_id: str, language: str = "en") -> str:
    try:
        transcript = YouTubeTranscriptApi.get_transcript(video_id, languages=[language])
        return " ".join([entry["text"] for entry in transcript])
    except Exception as e:
        return f"Transcript unavailable: {e}"

text = get_transcript("dQw4w9WgXcQ")
print(text[:500])

For complete Python scraping patterns, see our web scraping with Python guide. Node.js users should check the Node.js guide.

Why SimpleCrawl Is Better for YouTube

FeatureYouTube Data APIDIY ScrapingSimpleCrawl
Quota limits10,000 units/dayIP-basedHigh throughput
Full commentsPaginated (quota)Complex scrollAI extraction
TranscriptsNot availableLibrary requiredIncluded
Channel analyticsLimitedFragileStructured
SetupAPI key + quotaPlaywright + libsOne API call
CostFree (limited)Time + proxiesFrom $0

The YouTube Data API has strict quotas (a single search costs 100 units out of 10,000 daily). SimpleCrawl provides unrestricted access to all visible page data. See pricing for details.

  • YouTube's ToS prohibit scraping — Google's Terms of Service explicitly ban automated access to YouTube content.
  • YouTube Data API ToS — even the official API has usage restrictions on data storage, display, and commercial use.
  • Copyright — video content, thumbnails, and descriptions are copyrighted. Extracting transcripts for analysis is generally fair use; redistributing them may not be.
  • COPPA — scraping data related to children's content has additional legal restrictions under COPPA.
  • EU Digital Services Act — YouTube data scraping may be subject to the DSA's requirements for researchers and auditors.

Check YouTube's crawling rules with our robots.txt checker.

FAQ

Can I scrape YouTube without the Data API?

Yes. Web scraping captures everything visible on the page, including data the API doesn't expose (exact like counts, comment replies, related videos). SimpleCrawl handles this without API quotas.

How do I get YouTube video transcripts?

Use the youtube-transcript-api Python library for individual videos, or SimpleCrawl for batch extraction. Transcripts are available for most videos with auto-generated or manual captions.

Is yt-dlp good for YouTube data scraping?

yt-dlp excels at downloading video/audio and extracting metadata. For structured data extraction at scale (search results, channel analytics, comments), SimpleCrawl is more efficient.

How many YouTube videos can I scrape per day?

The YouTube Data API allows ~100 search queries per day (at 100 units each from a 10,000-unit quota). SimpleCrawl has no such quota limitation — see pricing for credit-based limits.

Can I scrape YouTube Shorts data?

Yes. YouTube Shorts pages contain the same metadata structure as regular videos. SimpleCrawl extracts title, views, likes, and comments from Shorts URLs.

Ready to try SimpleCrawl?

We're building the simplest web scraping API for AI. Join the waitlist and get 500 free credits at launch.

More scraping guides

Get early access + 500 free credits