SimpleCrawl

Web Scraping with Go: Colly, Goquery, and Beyond (2026)

Build fast, concurrent web scrapers with Go using Colly and Goquery. Learn high-performance data extraction patterns for production systems.

9 min read
gogolangtutorialweb scrapingcollygoquery

Web scraping with Go combines raw performance with Go's exceptional concurrency model. While Python and JavaScript dominate hobbyist scraping, Go is the choice for production scrapers that need to process millions of pages with minimal resource usage. This guide covers Go's scraping ecosystem — from Colly's framework approach to raw HTTP clients with Goquery — with patterns for building scrapers that run reliably at scale.

Why Go for Web Scraping?

Go brings specific strengths to data extraction:

  • Concurrency primitives — goroutines and channels make parallel scraping trivial without callback hell
  • Compiled performance — 10-100x faster than Python for CPU-bound parsing tasks
  • Low memory footprint — a Go scraper uses a fraction of the memory of a Python/Node.js equivalent
  • Single binary deployment — compile and deploy anywhere without runtime dependencies
  • Standard librarynet/http, encoding/json, encoding/csv cover most needs without third-party packages
  • Static typing — the compiler catches structural bugs before they hit production

For scripting-oriented approaches, see our Python guide or JavaScript guide. For type-safe Node.js scraping, check our TypeScript guide.

Setting Up Your Environment

mkdir go-scraper && cd go-scraper
go mod init go-scraper
go get github.com/gocolly/colly/v2
go get github.com/PuerkitoBio/goquery

Colly is Go's most popular scraping framework, providing request queueing, rate limiting, caching, and a callback-based API.

Basic Colly Scraper

package main

import (
	"encoding/json"
	"fmt"
	"os"

	"github.com/gocolly/colly/v2"
)

type Article struct {
	Title string `json:"title"`
	URL   string `json:"url"`
	Rank  int    `json:"rank"`
}

func main() {
	c := colly.NewCollector(
		colly.AllowedDomains("news.ycombinator.com"),
		colly.UserAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "+
			"AppleWebKit/537.36 Chrome/122.0.0.0 Safari/537.36"),
	)

	var articles []Article
	rank := 0

	c.OnHTML(".athing", func(e *colly.HTMLElement) {
		rank++
		articles = append(articles, Article{
			Title: e.ChildText(".titleline a"),
			URL:   e.ChildAttr(".titleline a", "href"),
			Rank:  rank,
		})
	})

	c.OnError(func(r *colly.Response, err error) {
		fmt.Printf("Error on %s: %v\n", r.Request.URL, err)
	})

	c.Visit("https://news.ycombinator.com/")

	data, _ := json.MarshalIndent(articles, "", "  ")
	os.WriteFile("articles.json", data, 0644)
	fmt.Printf("Scraped %d articles\n", len(articles))
}

Colly with Rate Limiting and Parallelism

package main

import (
	"fmt"
	"time"

	"github.com/gocolly/colly/v2"
)

type Product struct {
	Name  string `json:"name"`
	Price string `json:"price"`
	URL   string `json:"url"`
}

func main() {
	c := colly.NewCollector(
		colly.AllowedDomains("example.com"),
		colly.Async(true),
	)

	c.Limit(&colly.LimitRule{
		DomainGlob:  "*example.com*",
		Parallelism: 4,
		Delay:       time.Second,
		RandomDelay: 500 * time.Millisecond,
	})

	var products []Product

	c.OnHTML(".product-card", func(e *colly.HTMLElement) {
		products = append(products, Product{
			Name:  e.ChildText("h3"),
			Price: e.ChildText(".price"),
			URL:   e.Request.AbsoluteURL(e.ChildAttr("a", "href")),
		})
	})

	// Follow pagination links
	c.OnHTML("a.next-page", func(e *colly.HTMLElement) {
		e.Request.Visit(e.Attr("href"))
	})

	c.OnRequest(func(r *colly.Request) {
		fmt.Printf("Visiting %s\n", r.URL)
	})

	c.Visit("https://example.com/products")
	c.Wait()

	fmt.Printf("Scraped %d products\n", len(products))
}

Colly with Proxy Rotation

package main

import (
	"github.com/gocolly/colly/v2"
	"github.com/gocolly/colly/v2/proxy"
)

func main() {
	c := colly.NewCollector()

	rp, err := proxy.RoundRobinProxySwitcher(
		"http://user:pass@proxy1.example.com:8080",
		"http://user:pass@proxy2.example.com:8080",
		"http://user:pass@proxy3.example.com:8080",
	)
	if err != nil {
		panic(err)
	}

	c.SetProxyFunc(rp)

	c.OnHTML("h1", func(e *colly.HTMLElement) {
		println(e.Text)
	})

	c.Visit("https://example.com")
}

Colly Site-Wide Crawler

package main

import (
	"encoding/json"
	"fmt"
	"os"
	"strings"

	"github.com/gocolly/colly/v2"
)

type Page struct {
	URL         string `json:"url"`
	Title       string `json:"title"`
	Description string `json:"description"`
	H1          string `json:"h1"`
}

func main() {
	domain := "example.com"
	c := colly.NewCollector(
		colly.AllowedDomains(domain),
		colly.MaxDepth(3),
		colly.Async(true),
	)

	c.Limit(&colly.LimitRule{
		DomainGlob:  "*",
		Parallelism: 8,
	})

	var pages []Page

	c.OnHTML("html", func(e *colly.HTMLElement) {
		pages = append(pages, Page{
			URL:         e.Request.URL.String(),
			Title:       e.ChildText("title"),
			Description: e.ChildAttr("meta[name=description]", "content"),
			H1:          e.ChildText("h1"),
		})
	})

	c.OnHTML("a[href]", func(e *colly.HTMLElement) {
		link := e.Attr("href")
		if strings.HasPrefix(link, "/") || strings.Contains(link, domain) {
			e.Request.Visit(e.Request.AbsoluteURL(link))
		}
	})

	c.Visit(fmt.Sprintf("https://%s/", domain))
	c.Wait()

	data, _ := json.MarshalIndent(pages, "", "  ")
	os.WriteFile("site_crawl.json", data, 0644)
	fmt.Printf("Crawled %d pages\n", len(pages))
}

Method 2: HTTP Client + Goquery (Lightweight)

For maximum control without a framework, use Go's standard net/http with Goquery for jQuery-style DOM parsing.

Basic Goquery Scraping

package main

import (
	"fmt"
	"log"
	"net/http"

	"github.com/PuerkitoBio/goquery"
)

func main() {
	client := &http.Client{}
	req, _ := http.NewRequest("GET", "https://news.ycombinator.com/", nil)
	req.Header.Set("User-Agent",
		"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "+
			"AppleWebKit/537.36 Chrome/122.0.0.0 Safari/537.36")

	resp, err := client.Do(req)
	if err != nil {
		log.Fatal(err)
	}
	defer resp.Body.Close()

	doc, err := goquery.NewDocumentFromReader(resp.Body)
	if err != nil {
		log.Fatal(err)
	}

	doc.Find(".athing").Each(func(i int, s *goquery.Selection) {
		title := s.Find(".titleline a").First().Text()
		href, _ := s.Find(".titleline a").First().Attr("href")
		fmt.Printf("%d. %s\n   %s\n\n", i+1, title, href)
	})
}

Concurrent Scraping with Goroutines

package main

import (
	"fmt"
	"net/http"
	"sync"

	"github.com/PuerkitoBio/goquery"
)

type PageResult struct {
	URL   string
	Title string
	Error error
}

func scrapeURL(client *http.Client, url string) PageResult {
	req, _ := http.NewRequest("GET", url, nil)
	req.Header.Set("User-Agent", "Mozilla/5.0 (compatible; GoScraper/1.0)")

	resp, err := client.Do(req)
	if err != nil {
		return PageResult{URL: url, Error: err}
	}
	defer resp.Body.Close()

	doc, err := goquery.NewDocumentFromReader(resp.Body)
	if err != nil {
		return PageResult{URL: url, Error: err}
	}

	title := doc.Find("title").Text()
	return PageResult{URL: url, Title: title}
}

func main() {
	urls := []string{
		"https://example.com/page1",
		"https://example.com/page2",
		"https://example.com/page3",
		"https://example.com/page4",
		"https://example.com/page5",
	}

	client := &http.Client{}
	results := make(chan PageResult, len(urls))
	var wg sync.WaitGroup

	// Semaphore to limit concurrency
	sem := make(chan struct{}, 3)

	for _, url := range urls {
		wg.Add(1)
		go func(u string) {
			defer wg.Done()
			sem <- struct{}{}
			defer func() { <-sem }()

			results <- scrapeURL(client, u)
		}(url)
	}

	go func() {
		wg.Wait()
		close(results)
	}()

	for result := range results {
		if result.Error != nil {
			fmt.Printf("Error: %s — %v\n", result.URL, result.Error)
		} else {
			fmt.Printf("OK: %s — %s\n", result.URL, result.Title)
		}
	}
}

Worker Pool Pattern

For production scrapers processing thousands of URLs:

package main

import (
	"context"
	"fmt"
	"net/http"
	"sync"
	"time"

	"github.com/PuerkitoBio/goquery"
)

type Job struct {
	URL string
}

type Result struct {
	URL     string
	Title   string
	Links   int
	Elapsed time.Duration
	Error   error
}

func worker(ctx context.Context, client *http.Client, jobs <-chan Job, results chan<- Result, wg *sync.WaitGroup) {
	defer wg.Done()

	for job := range jobs {
		select {
		case <-ctx.Done():
			return
		default:
		}

		start := time.Now()
		req, _ := http.NewRequestWithContext(ctx, "GET", job.URL, nil)
		req.Header.Set("User-Agent", "Mozilla/5.0 (compatible; GoScraper/1.0)")

		resp, err := client.Do(req)
		if err != nil {
			results <- Result{URL: job.URL, Error: err, Elapsed: time.Since(start)}
			continue
		}

		doc, err := goquery.NewDocumentFromReader(resp.Body)
		resp.Body.Close()
		if err != nil {
			results <- Result{URL: job.URL, Error: err, Elapsed: time.Since(start)}
			continue
		}

		title := doc.Find("title").Text()
		links := doc.Find("a[href]").Length()

		results <- Result{
			URL:     job.URL,
			Title:   title,
			Links:   links,
			Elapsed: time.Since(start),
		}

		time.Sleep(500 * time.Millisecond) // Rate limit
	}
}

func main() {
	urls := make([]string, 100)
	for i := range urls {
		urls[i] = fmt.Sprintf("https://example.com/page/%d", i+1)
	}

	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
	defer cancel()

	client := &http.Client{Timeout: 10 * time.Second}
	jobs := make(chan Job, len(urls))
	results := make(chan Result, len(urls))

	var wg sync.WaitGroup
	numWorkers := 5

	for i := 0; i < numWorkers; i++ {
		wg.Add(1)
		go worker(ctx, client, jobs, results, &wg)
	}

	for _, url := range urls {
		jobs <- Job{URL: url}
	}
	close(jobs)

	go func() {
		wg.Wait()
		close(results)
	}()

	var successCount, errorCount int
	for result := range results {
		if result.Error != nil {
			errorCount++
		} else {
			successCount++
		}
	}

	fmt.Printf("Done: %d success, %d errors\n", successCount, errorCount)
}

Method 3: SimpleCrawl API

For Go services that need web data without managing scraping infrastructure, SimpleCrawl provides a clean HTTP API:

package main

import (
	"bytes"
	"encoding/json"
	"fmt"
	"io"
	"net/http"
)

const apiKey = "sc_your_api_key"

type ScrapeRequest struct {
	URL    string            `json:"url"`
	Format string            `json:"format"`
	Schema map[string]string `json:"schema,omitempty"`
}

type ScrapeResponse struct {
	URL        string `json:"url"`
	Title      string `json:"title"`
	Markdown   string `json:"markdown,omitempty"`
	CreditsUsed int   `json:"credits_used"`
}

func scrape(url string) (*ScrapeResponse, error) {
	reqBody, _ := json.Marshal(ScrapeRequest{
		URL:    url,
		Format: "markdown",
	})

	req, _ := http.NewRequest("POST", "https://api.simplecrawl.com/v1/scrape", bytes.NewReader(reqBody))
	req.Header.Set("Authorization", "Bearer "+apiKey)
	req.Header.Set("Content-Type", "application/json")

	resp, err := http.DefaultClient.Do(req)
	if err != nil {
		return nil, fmt.Errorf("request failed: %w", err)
	}
	defer resp.Body.Close()

	body, _ := io.ReadAll(resp.Body)
	if resp.StatusCode != http.StatusOK {
		return nil, fmt.Errorf("API error %d: %s", resp.StatusCode, body)
	}

	var result ScrapeResponse
	if err := json.Unmarshal(body, &result); err != nil {
		return nil, fmt.Errorf("decode failed: %w", err)
	}

	return &result, nil
}

func main() {
	result, err := scrape("https://example.com")
	if err != nil {
		panic(err)
	}

	fmt.Printf("Title: %s\n", result.Title)
	fmt.Printf("Markdown length: %d\n", len(result.Markdown))
	fmt.Printf("Credits used: %d\n", result.CreditsUsed)
}

See the pricing page for API credits.

Choosing the Right Approach

ApproachBest ForConcurrencyJS RenderingComplexity
CollySite crawlingBuilt-inNoLow-Medium
Goquery + goroutinesCustom controlManualNoMedium
SimpleCrawl APIAll pagesN/AYesVery Low

Go doesn't have native headless browser support like Python's Playwright or Node's Puppeteer. For JavaScript-rendered pages, use SimpleCrawl or call a headless browser service via API.

For scraping specific sites, check our domain guides: Amazon, Google, Reddit. Compare tools on our best web scraping APIs page.

FAQ

Can Go scrape JavaScript-rendered pages?

Go doesn't have a native Playwright/Puppeteer equivalent. Options include: chromedp (Chrome DevTools Protocol), calling a headless browser via Docker, or using SimpleCrawl which handles JS rendering server-side.

Is Go faster than Python for web scraping?

For HTTP requests and HTML parsing, Go is significantly faster and uses less memory. However, scraping speed is usually network-bound, not CPU-bound. Go's advantage shows at scale (millions of pages) where memory efficiency and goroutine concurrency reduce infrastructure costs.

Should I use Colly or raw HTTP + Goquery?

Colly for typical crawling tasks — it handles queueing, deduplication, rate limiting, and caching out of the box. Use raw HTTP + Goquery when you need custom request logic, specific connection pooling, or non-standard crawling patterns.

How do I handle CAPTCHAs in Go?

Go doesn't have built-in CAPTCHA solving. Integrate with a service like 2Captcha via their REST API, or use SimpleCrawl which includes built-in CAPTCHA bypass.

Can I compile a Go scraper for deployment?

Yes — that's one of Go's biggest advantages. Run go build -o scraper and deploy the single binary to any Linux server, Docker container, or cloud function without installing dependencies.

Ready to try SimpleCrawl?

We're building the simplest web scraping API for AI. Join the waitlist and get 500 free credits at launch.

More guides

Get early access + 500 free credits