HOWTO scrape websites with Ruby & Ferrum

Web scraping is an approach for extracting data from websites that don’t have an API. This tutorial will show you how to scrape websites with Ruby and Ferrum, a Ruby gem for connecting to Chromium or Google Chrome browsers using the Chrome DevTools Protocol.


Getting started

Download and install either Chromium or Google Chrome, and install the ferrum gem:

gem install ferrum

Ferrum should automatically detect binaries in PATH, if you have problems you can either set the BROWSER_PATH environment variable or the browser_path option explicitly.

Then copy and paste the following Ruby script:

require 'ferrum'

browser = Ferrum::Browser.new

browser.goto('http://stackoverflow.com/')

puts browser.current_title

browser.quit

Run the Ruby script, and if everything is working correctly you should see the page title of the Stack Overflow website. This demonstrates the core of web scraping: navigating HTTP, and extracting data from HTML. Try changing the URL to a different website.

Extracting data

This next example extracts the title of the most recent post on the Rails blog:

require 'ferrum'

browser = Ferrum::Browser.new

browser.goto('http://weblog.rubyonrails.org/')

title = browser.at_css('article header h2')

puts title.text.strip

browser.quit

The #at_css method selects a single element based on the given CSS selector, and the #text method returns the textual content inside the element, in this case the title of the latest blog entry. Similarly you can use the #css method to select multiple elements.

Try changing the selector to see what other elements you can pick out of the page.

Following links

This next example follows the "Random article" link on the Wikipedia main page:

require 'ferrum'

browser = Ferrum::Browser.new

browser.goto('http://en.wikipedia.org/wiki/Main_Page')

browser.css('a[href]').find { |link| link.text == 'Random article' }.click

puts browser.current_url

browser.quit

The #click method fires the click event on the link element just as it would if you clicked it manually, and the #current_url method returns the address of the current page.

Ferrum doesn’t have any methods for searching by element text, so the example instead loops through all the links on the page and performs the filtering in Ruby.

Filling in a form

Next let’s look at an example of how to fill out a form, by writing a script to search the GOV.UK website for articles about passports:

require 'ferrum'

browser = Ferrum::Browser.new

browser.goto('https://www.gov.uk/')

element = browser.at_css('input[name=q]')
element.focus
element.type('passport', :enter)

begin
  results = browser.at_css('[id=js-results]')

  results.css('li a').each do |element|
    puts element.text.strip
  end
rescue Ferrum::NodeNotFoundError
  sleep 0.01
  retry
end

browser.quit

The #type method enters the given string into the input element, just as if we had been using a web browser directly. We can then use the enter key to submit the form.

The subsequent code lists out the top results on the following page. Ferrum does’t have implicit wait functionality like Capybara, so if your code depends on elements rendered by JavaScript you may need to implement some wait logic like this yourself.

Executing JavaScript

Ferrum can be used to execute JavaScript, which means you can interact with popup alert dialogs, search interfaces, text editors, simulate drag & drop, and much more.

For example, here’s how to detect which version of jQuery a website is using:

require 'ferrum'

browser = Ferrum::Browser.new

browser.goto('http://jquery.com/')

if browser.evaluate('typeof(jQuery)') == 'undefined'
  puts 'Not using jQuery'
else
  version = browser.evaluate('jQuery.fn.jquery')

  puts "Using jQuery #{version}"
end

browser.quit

Simply call the #evaluate method with the JavaScript expression you want to evaluate. JavaScript values like null, undefined, true, false, strings, arrays, objects are all converted to their Ruby equivalents.