Parsing html with ruby

Содержание

How to Parse HTML in Ruby with Nokogiri?
What Is Nokogiri?
How to Extract Text from a Web Page Using Nokogiri
Handling Pagination
Conclusion

How to Parse HTML in Ruby with Nokogiri?

APIs are the cornerstone of the modern internet as they enable different services to communicate with each other. With APIs, you can gather information from different sources and use different services. However, not all services provide an API for you to consume. Even if an API is offered, it might be limited in comparison to a service’s web application(s). Thankfully, you can use web scraping to overcome these limitations. Web scraping refers to the practice of extracting data from the HTML source of the web page. That is, instead of communicating with a server through APIs, web scraping lets you extract information directly from the web page itself.

Web scraping is often more accessible than using an API because with an API, you have to complete several additional steps: you need to get an API key, set up your script to make a request to the API with the correctly formatted request body, and finally, parse the response body to get the data you want. Then add in the time spent to read the API documentation and figure out the necessary details about the API, and you can see how this quickly becomes a hassle. With web scraping, however, you simply need to look at the HTML structure of the web page and determine which elements you want to extract the data from.

Читайте также: Java date получить день

In this article, you’ll learn how to parse HTML from a web page in Ruby using the Nokogiri gem. You can find the code for this article in this GitHub repo.

What Is Nokogiri?

Nokogiri (鋸) is a lightweight Ruby gem that makes it easy to parse, build, and search HTML and XML documents. Nokogiri provides a DOM parser, SAX parser, and push parser, as well as the ability to search documents using Xpath and CSS3 selectors.

How to Extract Text from a Web Page Using Nokogiri

In this tutorial, you’ll scrape this web page, which provides a database of NHL team stats since 1990. You’ll build a web scraper that will let the user enter a search term and then display the stats for all teams matching the search.

Before you get your hands dirty with coding, spend some time on the web page. Specifically, you need to know how to find the elements you want in the HTML source code using CSS selectors.

Your first task is to get familiar with the way the web page works. Since you want to perform a search, you need to figure out what happens on the web page when you do so. Go ahead and enter a search. You’ll see that it redirects to a URL that looks like this:

https://www.scrapethissite.com/pages/forms/?q=

This tells you the URL you need to scrape. Let’s get started!

To follow along with the tutorial, create a directory:

mkdir nokogiri-demo cd nokogiri demo

Create a file called sscrape.rb. Import the nokogiri and open-uri modules. The open-uri module is a convenience wrapper around the net/http module.

require 'nokogiri' require 'open-uri'

Add the following lines to input the search term from the user:

puts "Enter a search term" search_term = gets.chomp

Once you have the search term, you need to make a request to https://www.scrapethissite.com/pages/forms/?q= . The URI.open method makes the request and returns the HTML of the page, which is then passed to the Nokogiri::HTML5 class that returns a Document :

doc = Nokogiri::HTML5(URI.open("https://www.scrapethissite.com/pages/forms/?q=#search_term>"))

To extract text from a particular element on the page, you first need a way to select the element. Using a CSS selector is the easiest way to go about this.

To look at the source code of the page, press CTRL+Shift+I to open up the developer console and click on the tiny mouse pointer:

You can now hover your mouse over any element, and it will be highlighted in the HTML source. Hover over a single team to find out its HTML structure.

From this, you can infer the CSS selector required to select the data: tr.team . The css method takes the CSS selector and returns a NodeSet of elements matching the query. Then you can simply iterate over the elements using the each method.

Once you have the element, you can extract its text using the content method:

doc.css("tr.team").each do |team| puts team.at(".name").content.strip end

Run the code using ruby scrape.rb , and enter a search term when prompted. You should see the names of the teams matching the search query:

Let’s go one step further and extract more information. How about the total wins of each team?

require 'nokogiri' require 'open-uri' puts "Enter a search term" search_term = gets.chomp doc = Nokogiri::HTML5(URI.open("https://www.scrapethissite.com/pages/forms/?q=#search_term>")) teams = <> doc.css("tr.team").each do |team| team_name = team.at(".name").content.strip team_wins = team.at(".wins").content.to_i teams[team_name] ||= 0 teams[team_name] += team_wins end teams.each do |team_name, team_wins| puts "#team_name> => #team_wins>" end

Here, a hash named teams has been defined to store the scores. First, the code checks if the team name is already in the hash. If it is, the number of wins is added to the existing score. If it isn’t, the score is initialized to 0.

Run the script and enter a search term. You will see the total wins of each team:

Handling Pagination

Your code can now successfully extract text from the web page. But this is just the first page. Let’s take the script to the next level by adding the ability to handle pagination.

The code starts similarly—by importing the required modules, prompting the user for a search term, and making a request to the first page:

require 'nokogiri' require 'open-uri' puts "Enter a search term" search_term = gets.chomp doc = Nokogiri::HTML5(URI.open("https://www.scrapethissite.com/pages/forms/?q=#search_term>"))

The following code extracts the second to last element as well as the page number:

pages = doc.css(".pagination a")[-2].content.to_i

As before, create a teams hash:

Then create a loop that will run from 1 to pages .

(1..pages).each do |page_num| puts "Scraping page #page_num>" end

This time, the URL will be slightly different. If you visit any page (other than the first one) in your browser, you’ll see a page_num query parameter added to the URL. This parameter is the page number you want to view. You’ll land on the first page if you set page_num to 1 . This is a huge relief, as you don’t need to handle the first page differently.

Add the following code to the each loop. Note that this code is the same as what you did previously. The only difference is that this time, it is executed for each page.

(1..pages).each do |page_num| puts "Scraping page #page_num>" doc = Nokogiri::HTML5(URI.open("https://www.scrapethissite.com/pages/forms/?q=#search_term>&page_num=#page_num>")) doc.css("tr.team").each do |team| team_name = team.at(".name").content.strip team_wins = team.at(".wins").content.to_i teams[team_name] ||= 0 teams[team_name] += team_wins end end

teams.each do |team_name, team_wins| puts "#team_name> => #team_wins>" end

require 'nokogiri' require 'open-uri' puts "Enter a search term" search_term = gets.chomp doc = Nokogiri::HTML(URI.open("https://www.scrapethissite.com/pages/forms/?q=#search_term>")) pages = doc.css(".pagination a")[-2].content.to_i teams = <> (1..pages).each do |page_num| puts "Scraping page #page_num>" doc = Nokogiri::HTML(URI.open("https://www.scrapethissite.com/pages/forms/?q=#search_term>&page_num=#page_num>")) doc.css("tr.team").each do |team| team_name = team.at(".name").content.strip team_wins = team.at(".wins").content.to_i teams[team_name] ||= 0 teams[team_name] += team_wins end end teams.each do |team_name, team_wins| puts "#team_name> => #team_wins>" end

Run the code, and you’ll see that it now scrapes each page and provides each team’s total number of wins:

To check out the final code, you can access this GitHub repository.

Conclusion

As you’ve seen, web scraping is a powerful tool that you can use to extract text from a web page. You can use it in the absence of an API, or you can use it to overcome an API’s limitations. In this tutorial, you learned how to use the Nokogiri gem to scrape a web page and extract text.

Even though Nokogiri is lightweight and simple, there are some use cases that are outside its scope. For example, if you’re scraping a dynamic web page that runs JavaScript, Nokogiri will not be able to do the job because it can’t execute JavaScript. In that case, a sophisticated tool like Watir may be more suitable.

If you’re looking for a fast, powerful, and easy-to-use web scraping tool, be sure to check out ScrapingBee. It handles headless browsers at scale and rotates proxies, giving you an unmatched and uninterrupted web scraping experience.

Aniket is a student doing a Master’s in Mathematics and has a passion for computers and software. He likes to explore various areas related to coding and works as a web developer using Ruby on Rails and Vue.JS.

Источник