HOWTO scrape websites with Ruby & Mechanize

Web scraping is an approach for extracting data from websites that don’t have an API. Web scraping code is inherently “brittle” (prone to breaking over time due to changes in the website content and structure), but it’s a flexible technique with a broad range of uses.

This tutorial will show you how to scrape websites with Ruby and the mechanize gem. Note that web scraping may be against the terms of use of some websites—please be a good web citizen and check first.

Getting started

Firstly, make sure you have the mechanize gem installed:

gem install mechanize

Then copy and paste the following short Ruby script:

require 'mechanize'

mechanize = Mechanize.new

page = mechanize.get('https://stackoverflow.com/')

puts page.title

Run the script with a Ruby, and you should see the page title of the Stack Overflow website. This demonstrates the core features of mechanize: navigating HTTP, and extracting data from HTML. Try changing the URL to a different website.

Extracting data

Mechanize uses the nokogiri gem internally to parse HTML responses. For this next example we’ll look at how to use a couple of the methods provided by nokogiri to extract the title of the most recent blog post on the Rails blog:

require 'mechanize'

mechanize = Mechanize.new

page = mechanize.get('https://weblog.rubyonrails.org/')

puts page.at('article header h2').text.strip

The #at method takes a CSS selector string and returns the first matching node, which in this example is an h2 element. The #text method returns the textual content inside the element, in this case the title of the blog post.

Try changing the selector to see what other elements you can pick out of the page.

Following links

A key feature of web crawling, and the web in general, is following links from one page to another. In this next example we’ll follow a link on the Wikipedia main page which redirects to a random article:

require 'mechanize'

mechanize = Mechanize.new

page = mechanize.get('https://en.wikipedia.org/wiki/Main_Page')

link = page.link_with(text: 'Random article')

page = link.click

puts page.uri

The #link_with method is provided by mechanize, and makes it easy to pull out the random article link. The #click method instructs mechanize to follow the link, and the #uri method returns the address of the page. Notice that mechanize follows redirects automatically, so this example makes three HTTP requests in total.

Filling in a form

Lastly we’ll look at an example of how to fill out a form, by writing a script to search the GOV.UK website for articles about passports:

require 'mechanize'

mechanize = Mechanize.new

page = mechanize.get('https://www.gov.uk/')

form = page.forms.first

form['q'] = 'passport'

page = form.submit

page.search('#js-results li a').each do |element|
  puts element.text.strip
end

Instead of searching by CSS selector we pick the first form on the page, and set the value of the search field, just as if we had been using a web browser directly. Then we submit the form and list out the top results.

Exercises

To learn more, try fiddling with the examples to get an understanding of the different methods provided by mechanize and nokogiri. Some suggestions for other things to try:

Write a script to count the number of links in a web page
Write a script to calculate the most popular keywords on a web page
Write a script to simulate a user logging in to your local Rails app
Use the microformats2 gem to parse the microformats2 examples
Adapt the search form example to search Wikipedia or Amazon