Web scraping is an approach for extracting data from websites that don’t have an API. Web scraping code is inherently “brittle” (prone to breaking over time due to changes in the website content and structure), but it’s a flexible technique with a broad range of uses.
This tutorial will show you how to scrape websites with Ruby and the mechanize gem. Note that web scraping may be against the terms of use of some websites—please be a good web citizen and check first.
Getting started
Firstly, make sure you have the mechanize gem installed:
gem install mechanize
Then copy and paste the following short Ruby script:
require 'mechanize'
mechanize = Mechanize.new
page = mechanize.get('http://stackoverflow.com/')
puts page.title
Run the script with a Ruby, and you should see the page title of the Stack Overflow website. This demonstrates the core features of mechanize: navigating HTTP, and extracting data from HTML. Try changing the URL to a different website.
Extracting data
Mechanize uses the nokogiri gem internally to parse HTML responses. For this next example we’ll look at how to use a couple of the methods provided by nokogiri to extract the title of the most recent blog post on the Rails blog:
require 'mechanize'
mechanize = Mechanize.new
page = mechanize.get('http://weblog.rubyonrails.org/')
puts page.at('article header h2').text.strip
The #at method takes a CSS selector string and returns the first matching node, which in this example is an h2 element. The #text method returns the textual content inside the element, in this case the title of the blog post.
Try changing the selector to see what other elements you can pick out of the page.
Following links
A key feature of web crawling, and the web in general, is following links from one page to another. In this next example we’ll follow a link on the Wikipedia main page which redirects to a random article:
require 'mechanize'
mechanize = Mechanize.new
page = mechanize.get('http://en.wikipedia.org/wiki/Main_Page')
link = page.link_with(text: 'Random article')
page = link.click
puts page.uri
The #link_with method is provided by mechanize, and makes it easy to pull out the random article link. The #click method instructs mechanize to follow the link, and the #uri method returns the address of the page. Notice that mechanize follows redirects automatically, so this example makes three HTTP requests in total.
Filling in a form
Lastly we’ll look at an example of how to fill out a form, by writing a script to search the GOV.UK website for articles about passports:
require 'mechanize'
mechanize = Mechanize.new
page = mechanize.get('https://www.gov.uk/')
form = page.forms.first
form['q'] = 'passport'
page = form.submit
page.search('#js-results li a').each do |element|
puts element.text.strip
end
Instead of searching by CSS selector we pick the first form on the page, and set the value of the search field, just as if we had been using a web browser directly. Then we submit the form and list out the top results.
Exercises
To learn more, try fiddling with the examples to get an understanding of the different methods provided by mechanize and nokogiri. Some suggestions for other things to try:
- Write a script to count the number of links in a web page
- Write a script to calculate the most popular keywords on a web page
- Write a script to simulate a user logging in to your local Rails app
- Use the microformats2 gem to parse the microformats2 examples
- Adapt the search form example to search Wikipedia or Amazon