Web Scraping with Go: Colly, Goquery, and Beyond (2026)
Build fast, concurrent web scrapers with Go using Colly and Goquery. Learn high-performance data extraction patterns for production systems.
Web scraping with Go combines raw performance with Go's exceptional concurrency model. While Python and JavaScript dominate hobbyist scraping, Go is the choice for production scrapers that need to process millions of pages with minimal resource usage. This guide covers Go's scraping ecosystem — from Colly's framework approach to raw HTTP clients with Goquery — with patterns for building scrapers that run reliably at scale.
Why Go for Web Scraping?
Go brings specific strengths to data extraction:
- Concurrency primitives — goroutines and channels make parallel scraping trivial without callback hell
- Compiled performance — 10-100x faster than Python for CPU-bound parsing tasks
- Low memory footprint — a Go scraper uses a fraction of the memory of a Python/Node.js equivalent
- Single binary deployment — compile and deploy anywhere without runtime dependencies
- Standard library —
net/http,encoding/json,encoding/csvcover most needs without third-party packages - Static typing — the compiler catches structural bugs before they hit production
For scripting-oriented approaches, see our Python guide or JavaScript guide. For type-safe Node.js scraping, check our TypeScript guide.
Setting Up Your Environment
mkdir go-scraper && cd go-scraper
go mod init go-scraper
go get github.com/gocolly/colly/v2
go get github.com/PuerkitoBio/goquery
Method 1: Colly (Full-Featured Framework)
Colly is Go's most popular scraping framework, providing request queueing, rate limiting, caching, and a callback-based API.
Basic Colly Scraper
package main
import (
"encoding/json"
"fmt"
"os"
"github.com/gocolly/colly/v2"
)
type Article struct {
Title string `json:"title"`
URL string `json:"url"`
Rank int `json:"rank"`
}
func main() {
c := colly.NewCollector(
colly.AllowedDomains("news.ycombinator.com"),
colly.UserAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "+
"AppleWebKit/537.36 Chrome/122.0.0.0 Safari/537.36"),
)
var articles []Article
rank := 0
c.OnHTML(".athing", func(e *colly.HTMLElement) {
rank++
articles = append(articles, Article{
Title: e.ChildText(".titleline a"),
URL: e.ChildAttr(".titleline a", "href"),
Rank: rank,
})
})
c.OnError(func(r *colly.Response, err error) {
fmt.Printf("Error on %s: %v\n", r.Request.URL, err)
})
c.Visit("https://news.ycombinator.com/")
data, _ := json.MarshalIndent(articles, "", " ")
os.WriteFile("articles.json", data, 0644)
fmt.Printf("Scraped %d articles\n", len(articles))
}
Colly with Rate Limiting and Parallelism
package main
import (
"fmt"
"time"
"github.com/gocolly/colly/v2"
)
type Product struct {
Name string `json:"name"`
Price string `json:"price"`
URL string `json:"url"`
}
func main() {
c := colly.NewCollector(
colly.AllowedDomains("example.com"),
colly.Async(true),
)
c.Limit(&colly.LimitRule{
DomainGlob: "*example.com*",
Parallelism: 4,
Delay: time.Second,
RandomDelay: 500 * time.Millisecond,
})
var products []Product
c.OnHTML(".product-card", func(e *colly.HTMLElement) {
products = append(products, Product{
Name: e.ChildText("h3"),
Price: e.ChildText(".price"),
URL: e.Request.AbsoluteURL(e.ChildAttr("a", "href")),
})
})
// Follow pagination links
c.OnHTML("a.next-page", func(e *colly.HTMLElement) {
e.Request.Visit(e.Attr("href"))
})
c.OnRequest(func(r *colly.Request) {
fmt.Printf("Visiting %s\n", r.URL)
})
c.Visit("https://example.com/products")
c.Wait()
fmt.Printf("Scraped %d products\n", len(products))
}
Colly with Proxy Rotation
package main
import (
"github.com/gocolly/colly/v2"
"github.com/gocolly/colly/v2/proxy"
)
func main() {
c := colly.NewCollector()
rp, err := proxy.RoundRobinProxySwitcher(
"http://user:pass@proxy1.example.com:8080",
"http://user:pass@proxy2.example.com:8080",
"http://user:pass@proxy3.example.com:8080",
)
if err != nil {
panic(err)
}
c.SetProxyFunc(rp)
c.OnHTML("h1", func(e *colly.HTMLElement) {
println(e.Text)
})
c.Visit("https://example.com")
}
Colly Site-Wide Crawler
package main
import (
"encoding/json"
"fmt"
"os"
"strings"
"github.com/gocolly/colly/v2"
)
type Page struct {
URL string `json:"url"`
Title string `json:"title"`
Description string `json:"description"`
H1 string `json:"h1"`
}
func main() {
domain := "example.com"
c := colly.NewCollector(
colly.AllowedDomains(domain),
colly.MaxDepth(3),
colly.Async(true),
)
c.Limit(&colly.LimitRule{
DomainGlob: "*",
Parallelism: 8,
})
var pages []Page
c.OnHTML("html", func(e *colly.HTMLElement) {
pages = append(pages, Page{
URL: e.Request.URL.String(),
Title: e.ChildText("title"),
Description: e.ChildAttr("meta[name=description]", "content"),
H1: e.ChildText("h1"),
})
})
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
link := e.Attr("href")
if strings.HasPrefix(link, "/") || strings.Contains(link, domain) {
e.Request.Visit(e.Request.AbsoluteURL(link))
}
})
c.Visit(fmt.Sprintf("https://%s/", domain))
c.Wait()
data, _ := json.MarshalIndent(pages, "", " ")
os.WriteFile("site_crawl.json", data, 0644)
fmt.Printf("Crawled %d pages\n", len(pages))
}
Method 2: HTTP Client + Goquery (Lightweight)
For maximum control without a framework, use Go's standard net/http with Goquery for jQuery-style DOM parsing.
Basic Goquery Scraping
package main
import (
"fmt"
"log"
"net/http"
"github.com/PuerkitoBio/goquery"
)
func main() {
client := &http.Client{}
req, _ := http.NewRequest("GET", "https://news.ycombinator.com/", nil)
req.Header.Set("User-Agent",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "+
"AppleWebKit/537.36 Chrome/122.0.0.0 Safari/537.36")
resp, err := client.Do(req)
if err != nil {
log.Fatal(err)
}
defer resp.Body.Close()
doc, err := goquery.NewDocumentFromReader(resp.Body)
if err != nil {
log.Fatal(err)
}
doc.Find(".athing").Each(func(i int, s *goquery.Selection) {
title := s.Find(".titleline a").First().Text()
href, _ := s.Find(".titleline a").First().Attr("href")
fmt.Printf("%d. %s\n %s\n\n", i+1, title, href)
})
}
Concurrent Scraping with Goroutines
package main
import (
"fmt"
"net/http"
"sync"
"github.com/PuerkitoBio/goquery"
)
type PageResult struct {
URL string
Title string
Error error
}
func scrapeURL(client *http.Client, url string) PageResult {
req, _ := http.NewRequest("GET", url, nil)
req.Header.Set("User-Agent", "Mozilla/5.0 (compatible; GoScraper/1.0)")
resp, err := client.Do(req)
if err != nil {
return PageResult{URL: url, Error: err}
}
defer resp.Body.Close()
doc, err := goquery.NewDocumentFromReader(resp.Body)
if err != nil {
return PageResult{URL: url, Error: err}
}
title := doc.Find("title").Text()
return PageResult{URL: url, Title: title}
}
func main() {
urls := []string{
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3",
"https://example.com/page4",
"https://example.com/page5",
}
client := &http.Client{}
results := make(chan PageResult, len(urls))
var wg sync.WaitGroup
// Semaphore to limit concurrency
sem := make(chan struct{}, 3)
for _, url := range urls {
wg.Add(1)
go func(u string) {
defer wg.Done()
sem <- struct{}{}
defer func() { <-sem }()
results <- scrapeURL(client, u)
}(url)
}
go func() {
wg.Wait()
close(results)
}()
for result := range results {
if result.Error != nil {
fmt.Printf("Error: %s — %v\n", result.URL, result.Error)
} else {
fmt.Printf("OK: %s — %s\n", result.URL, result.Title)
}
}
}
Worker Pool Pattern
For production scrapers processing thousands of URLs:
package main
import (
"context"
"fmt"
"net/http"
"sync"
"time"
"github.com/PuerkitoBio/goquery"
)
type Job struct {
URL string
}
type Result struct {
URL string
Title string
Links int
Elapsed time.Duration
Error error
}
func worker(ctx context.Context, client *http.Client, jobs <-chan Job, results chan<- Result, wg *sync.WaitGroup) {
defer wg.Done()
for job := range jobs {
select {
case <-ctx.Done():
return
default:
}
start := time.Now()
req, _ := http.NewRequestWithContext(ctx, "GET", job.URL, nil)
req.Header.Set("User-Agent", "Mozilla/5.0 (compatible; GoScraper/1.0)")
resp, err := client.Do(req)
if err != nil {
results <- Result{URL: job.URL, Error: err, Elapsed: time.Since(start)}
continue
}
doc, err := goquery.NewDocumentFromReader(resp.Body)
resp.Body.Close()
if err != nil {
results <- Result{URL: job.URL, Error: err, Elapsed: time.Since(start)}
continue
}
title := doc.Find("title").Text()
links := doc.Find("a[href]").Length()
results <- Result{
URL: job.URL,
Title: title,
Links: links,
Elapsed: time.Since(start),
}
time.Sleep(500 * time.Millisecond) // Rate limit
}
}
func main() {
urls := make([]string, 100)
for i := range urls {
urls[i] = fmt.Sprintf("https://example.com/page/%d", i+1)
}
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
defer cancel()
client := &http.Client{Timeout: 10 * time.Second}
jobs := make(chan Job, len(urls))
results := make(chan Result, len(urls))
var wg sync.WaitGroup
numWorkers := 5
for i := 0; i < numWorkers; i++ {
wg.Add(1)
go worker(ctx, client, jobs, results, &wg)
}
for _, url := range urls {
jobs <- Job{URL: url}
}
close(jobs)
go func() {
wg.Wait()
close(results)
}()
var successCount, errorCount int
for result := range results {
if result.Error != nil {
errorCount++
} else {
successCount++
}
}
fmt.Printf("Done: %d success, %d errors\n", successCount, errorCount)
}
Method 3: SimpleCrawl API
For Go services that need web data without managing scraping infrastructure, SimpleCrawl provides a clean HTTP API:
package main
import (
"bytes"
"encoding/json"
"fmt"
"io"
"net/http"
)
const apiKey = "sc_your_api_key"
type ScrapeRequest struct {
URL string `json:"url"`
Format string `json:"format"`
Schema map[string]string `json:"schema,omitempty"`
}
type ScrapeResponse struct {
URL string `json:"url"`
Title string `json:"title"`
Markdown string `json:"markdown,omitempty"`
CreditsUsed int `json:"credits_used"`
}
func scrape(url string) (*ScrapeResponse, error) {
reqBody, _ := json.Marshal(ScrapeRequest{
URL: url,
Format: "markdown",
})
req, _ := http.NewRequest("POST", "https://api.simplecrawl.com/v1/scrape", bytes.NewReader(reqBody))
req.Header.Set("Authorization", "Bearer "+apiKey)
req.Header.Set("Content-Type", "application/json")
resp, err := http.DefaultClient.Do(req)
if err != nil {
return nil, fmt.Errorf("request failed: %w", err)
}
defer resp.Body.Close()
body, _ := io.ReadAll(resp.Body)
if resp.StatusCode != http.StatusOK {
return nil, fmt.Errorf("API error %d: %s", resp.StatusCode, body)
}
var result ScrapeResponse
if err := json.Unmarshal(body, &result); err != nil {
return nil, fmt.Errorf("decode failed: %w", err)
}
return &result, nil
}
func main() {
result, err := scrape("https://example.com")
if err != nil {
panic(err)
}
fmt.Printf("Title: %s\n", result.Title)
fmt.Printf("Markdown length: %d\n", len(result.Markdown))
fmt.Printf("Credits used: %d\n", result.CreditsUsed)
}
See the pricing page for API credits.
Choosing the Right Approach
| Approach | Best For | Concurrency | JS Rendering | Complexity |
|---|---|---|---|---|
| Colly | Site crawling | Built-in | No | Low-Medium |
| Goquery + goroutines | Custom control | Manual | No | Medium |
| SimpleCrawl API | All pages | N/A | Yes | Very Low |
Go doesn't have native headless browser support like Python's Playwright or Node's Puppeteer. For JavaScript-rendered pages, use SimpleCrawl or call a headless browser service via API.
For scraping specific sites, check our domain guides: Amazon, Google, Reddit. Compare tools on our best web scraping APIs page.
FAQ
Can Go scrape JavaScript-rendered pages?
Go doesn't have a native Playwright/Puppeteer equivalent. Options include: chromedp (Chrome DevTools Protocol), calling a headless browser via Docker, or using SimpleCrawl which handles JS rendering server-side.
Is Go faster than Python for web scraping?
For HTTP requests and HTML parsing, Go is significantly faster and uses less memory. However, scraping speed is usually network-bound, not CPU-bound. Go's advantage shows at scale (millions of pages) where memory efficiency and goroutine concurrency reduce infrastructure costs.
Should I use Colly or raw HTTP + Goquery?
Colly for typical crawling tasks — it handles queueing, deduplication, rate limiting, and caching out of the box. Use raw HTTP + Goquery when you need custom request logic, specific connection pooling, or non-standard crawling patterns.
How do I handle CAPTCHAs in Go?
Go doesn't have built-in CAPTCHA solving. Integrate with a service like 2Captcha via their REST API, or use SimpleCrawl which includes built-in CAPTCHA bypass.
Can I compile a Go scraper for deployment?
Yes — that's one of Go's biggest advantages. Run go build -o scraper and deploy the single binary to any Linux server, Docker container, or cloud function without installing dependencies.
Ready to try SimpleCrawl?
We're building the simplest web scraping API for AI. Join the waitlist and get 500 free credits at launch.
More guides
Web Scraping with JavaScript: Node.js Guide (2026)
Master web scraping with JavaScript using fetch, cheerio, and Puppeteer. Learn practical data extraction techniques for Node.js, plus how SimpleCrawl makes it effortless.
Web Scraping with Node.js: Complete Tutorial (2026)
Build powerful web scrapers with Node.js using Playwright, Crawlee, and async patterns. Learn advanced techniques for data extraction at scale with practical code examples.
Web Scraping with Python: The Complete Guide (2026)
Master web scraping with Python using requests, BeautifulSoup, Playwright, and Scrapy. Learn practical techniques for extracting data from any website, plus how SimpleCrawl simplifies everything.
Web Scraping with TypeScript: Type-Safe Data Extraction (2026)
Build reliable web scrapers with TypeScript using typed schemas, Zod validation, Playwright, and Cheerio. Catch scraping bugs at compile time, not in production.