🚜 A Simple Web Scraper in Go
In my previous job at Sendwithus, we’d been having trouble writing performant concurrent systems in Python. After many attempts, we came to the conclusion that Python just wasn’t suitable for some of our high throughput tasks, so we started experimenting with Go as a potential replacement.
After making it all the way through the Golang Interactive Tour, which I highly recommend doing so if you haven’t already, I wanted to build something real. The last task in the Go tour is to build a concurrent web crawler, but it faked the fun parts like making HTTP requests and parsing HTML. It was this that motivated me to open my IDE and try it myself. This post will walk you through the steps I tool to build a simple web scraper in Go.
We’ll go over three main topics:
- using the net/http package to fetch a web page
- using the golang.org/x/net/html to parse an HTML document
- using Go concurrency with multi-channel communication
In order to keep this tutorial short, I won’t be accommodating those of you that haven’t yet finished the Go Tour. The tour will teach you everything you need to know to follow along.
Building a Web Scraper
As I mentioned in the introduction, we’ll be building a simple web scraper in Go. Note that I didn’t say web crawler because our scraper will only be going one level deep (maybe I’ll cover crawling in another post).
We’re going to be building a basic command line tool that takes an input of seed URLs, scrapes them, then prints the links it finds on those pages.
Here’s an example of it in action:
$ go run main.go https://schier.co https://insomnia.rest Found 7 unique urls: - https://insomnia.rest - https://twitter.com/GregorySchier - https://support.insomnia.rest - https://chat.insomnia.rest - https://github.com/Kong/insomnia - https://twitter.com/GetInsomnia - https://konghq.com
Now that we know what we’re building, let’s get to the fun part—putting it together.
To make this tutorial easier to digest, I’ll be breaking it down isolated components. After going over each component, I’ll put them all together to form the final product. The first component we’ll be going over is making an HTTP request to fetch some HTML.
1. Fetching a Web Page By URL
Go includes a really good HTTP library out of the box. The http package provides a http.Get(url) method that only requires a few lines of code.
Note that things like error handling are omitted to keep this example short.
//~~~~~~~~~~~~~~~~~~~~~~// // Make an HTTP request // //~~~~~~~~~~~~~~~~~~~~~~// resp, _ := http.Get(url) bytes, _ := ioutil.ReadAll(resp.Body) fmt.Println("HTML:\n\n", string(bytes)) resp.Body.Close()
Making an HTTP request is the foundation of a web scraper so now that we know how to do that, we can move on to handling the HTML contents returned.
2. Finding Tags in HTML
Go doesn’t have a core package for parsing HTML but there is on included in the Golang SubRepositores that we can imported from golang.org/x/net/html .
If you’ve never interacted with an XML or HTML tokenizer before, this may take some time to grasp but I believe in you.
The module’s tokenizer splits the HTML document into “tokens” that can be iterated over. So, to find anchor tags (link) we can tokenize the HTML and iterate over the tokens to find the tags. Here are the possible things that a token can represent (documentation):
Token Name | Token Description |
---|---|
ErrorToken | error during tokenization (or end of document) |
TextToken | text node (contents of an element) |
StartTagToken | example |
EndTagToken | example |
SelfClosingTagToken | example |
CommentToken | example |
DoctypeToken | example |
The code below demonstrates how to find all the opening anchor tags in an HTML document.
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~// // Parse HTML for Anchor Tags // //~~~~~~~~~~~~~~~~~~~~~~~~~~~~// z := html.NewTokenizer(response.Body) for tt := z.Next() switch case tt == html.ErrorToken: // End of the document, we're done return case tt == html.StartTagToken: t := z.Token() isAnchor := t.Data == "a" if isAnchor fmt.Println("We found a link!") > > >
Now that we have found the anchor tags, how do we get the href value? Unfortunately, it’s not as easy as you would expect. A token stores it’s attributes in an array, so have to perform a similar iteration technique.
//~~~~~~~~~~~~~~~~~~~~// // Find Tag Attribute // //~~~~~~~~~~~~~~~~~~~~// for _, a := range t.Attr if a.Key == "href" fmt.Println("Found href:", a.Val) break > >
At this point we know how to fetch HTML using an HTTP request, as well as extract the links from that HTML document. Now let’s put it all together.
3. Introduce Goroutines and Channels
In order to make our scraper performant, and to make this tutorial a bit more advanced, we’ll make use of goroutines and channels, Go’s utilities for executing concurrent tasks.
The trickiest part of this scraper is how it uses channels. In order for the scraper to run quickly, it needs to fetch all URLs concurrently. When concurrency is applied, total execution time should equal the time taken to fetch the slowest request. Without concurrency, execution time would equal the sum of all request times since it would be executing them one after the other. So how do we do this?
The approach I took is to create a goroutine for each request and have each one publish the URLs it finds to a shared channel. There’s one problem with this though. How do we know when the last URL is sent to the channel so we can close it? For this, we can use a second channel for communicating status.
The second channel is simply a notification channel. After a goroutine has published all of it’s URLs into the main channel, it will publish a done message to the notification channel. Then, main thread can subscribe to the notification channel and stop execution after all goroutines have notified that they are finished. Don’t worry, This will make much more sense when you see the finished code.
Putting it All Together
If you’ve made it this far, you should know everything necessary to understand the full program, so here it is. I’ve also added a few comments to help explain some of the more complicated parts.
package main import ( "fmt" "golang.org/x/net/html" "net/http" "os" "strings" ) // Helper function to pull the href attribute from a Token func getHref(t html.Token) (ok bool, href string) // Iterate over token attributes until we find an "href" for _, a := range t.Attr if a.Key == "href" href = a.Val ok = true > > // "bare" return will return the variables (ok, href) as // defined in the function definition return > // Extract all http** links from a given webpage func crawl(url string, ch chan string, chFinished chan bool) resp, err := http.Get(url) defer func() // Notify that we're done after this function chFinished true >() if err != nil fmt.Println("ERROR: Failed to crawl:", url) return > b := resp.Body defer b.Close() // close Body when the function completes z := html.NewTokenizer(b) for tt := z.Next() switch case tt == html.ErrorToken: // End of the document, we're done return case tt == html.StartTagToken: t := z.Token() // Check if the token is an tag isAnchor := t.Data == "a" if !isAnchor continue > // Extract the href value, if there is one ok, url := getHref(t) if !ok continue > // Make sure the url begines in http** hasProto := strings.Index(url, "http") == 0 if hasProto ch url > > > > func main() foundUrls := make(map[string]bool) seedUrls := os.Args[1:] // Channels chUrls := make(chan string) chFinished := make(chan bool) // Kick off the crawl process (concurrently) for _, url := range seedUrls go crawl(url, chUrls, chFinished) > // Subscribe to both channels for c := 0; c len(seedUrls); select case url := chUrls: foundUrls[url] = true case chFinished: c++ > > // We're done! Print the results. fmt.Println("\nFound", len(foundUrls), "unique urls:\n") for url, _ := range foundUrls fmt.Println(" - " + url) > close(chUrls) >
That wraps up the tutorial of a basic Go web scraper! We’ve covered making HTTP requests, parsing HTML, and even some complex concurrency patterns.
If you’d like to take it a step further, try turning this web scraper into a web crawler and feed the URLs it finds back in as inputs. Then, see how far your crawler gets. 🚀
As always, thanks for reading! 🙂
If you enjoyed this tutorial, please consider sponsoring my work on GitHub 🤗
[Golang] Download HTML From URL
Download and save HTML file from given URL via Golang. Do nothing if the HTML file already locally exists.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
package main import ( "flag" "fmt" "io" "net/http" "os" "path" ) func download(url, filename string) (err error) fmt.Println("Downloading ", url, " to ", filename) resp, err := http.Get(url) if err != nil return > defer resp.Body.Close() f, err := os.Create(filename) if err != nil return > defer f.Close() _, err = io.Copy(f, resp.Body) return > func main() pUrl := flag.String("url", "", "URL to be processed") flag.Parse() url := *pUrl if url == "" fmt.Fprintf(os.Stderr, "Error: empty URL!\n") return > filename := path.Base(url) fmt.Println("Checking if " + filename + " exists . ") if _, err := os.Stat(filename); os.IsNotExist(err) err := download(url, filename) if err != nil panic(err) > fmt.Println(filename + " saved!") > else fmt.Println(filename + " already exists!") > >
1 2 3 4 5 6 7 8 9 10 11 12 13
export GOROOT=$(realpath ../../../../../go) export GOPATH=$(realpath .) export PATH := $(GOROOT)/bin:$(GOPATH)/bin:$(PATH) URL=https://siongui.github.io/index.html default: @echo "\033[92mProcessing $URL>. \033[0m" @go run download.go -url=$URL> fmt: @echo "\033[92mGo fmt source code. \033[0m" @go fmt *.go
Tested on: Ubuntu Linux 18.04, Go 1.10.1.
How to download and parse HTML page in Go
This example uses goquery to request a HTML page (https://techoverflow.net) via the Go net/http client and then uses goquery and a simple CSS-style query to select the HTML tag and print it’s content.
package main import ( "fmt" "log" "net/http" "github.com/PuerkitoBio/goquery" ) func main() < // Perform request resp, err := http.Get("https://techoverflow.net") if err != nil < print(err) return >// Cleanup when this function ends defer resp.Body.Close() // Read & parse response data doc, err := goquery.NewDocumentFromReader(resp.Body) if err != nil < log.Fatal(err) >// Print content of doc.Find("title").Each(func(i int, s *goquery.Selection) < fmt.Printf("Title of the page: %s\n", s.Text()) >) >
Title of the page: TechOverflow
If this post helped you, please consider buying me a coffee or donating via PayPal to support research & publishing of new posts on TechOverflow
Search
Categories
This website uses cookies to improve your experience. We’ll assume you’re ok with this, but you can opt-out if you wish. Cookie settingsACCEPTPrivacy & Cookies Policy
Privacy Overview
This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience.
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.