Web scraping is an approach for extracting data from websites that don’t have an API. This tutorial will show you how to scrape websites with Ruby and Ferrum, a Ruby gem for connecting to Chromium or Google Chrome browsers using the Chrome DevTools Protocol.
Getting started
Download and install either Chromium or Google Chrome, and install the ferrum gem:
gem install ferrum
Ferrum should automatically detect binaries in PATH, if you have problems you can either set the BROWSER_PATH environment variable or the browser_path option explicitly.
Then copy and paste the following Ruby script:
require 'ferrum'
browser = Ferrum::Browser.new
browser.goto('http://stackoverflow.com/')
puts browser.current_title
browser.quit
Run the Ruby script, and if everything is working correctly you should see the page title of the Stack Overflow website. This demonstrates the core of web scraping: navigating HTTP, and extracting data from HTML. Try changing the URL to a different website.
Extracting data
This next example extracts the title of the most recent post on the Rails blog:
require 'ferrum'
browser = Ferrum::Browser.new
browser.goto('http://weblog.rubyonrails.org/')
title = browser.at_css('article header h2')
puts title.text.strip
browser.quit
The #at_css method selects a single element based on the given CSS selector, and the #text method returns the textual content inside the element, in this case the title of the latest blog entry. Similarly you can use the #css method to select multiple elements.
Try changing the selector to see what other elements you can pick out of the page.
Following links
This next example follows the "Random article" link on the Wikipedia main page:
require 'ferrum'
browser = Ferrum::Browser.new
browser.goto('http://en.wikipedia.org/wiki/Main_Page')
browser.css('a[href]').find { |link| link.text == 'Random article' }.click
puts browser.current_url
browser.quit
The #click method fires the click event on the link element just as it would if you clicked it manually, and the #current_url method returns the address of the current page.
Ferrum doesn’t have any methods for searching by element text, so the example instead loops through all the links on the page and performs the filtering in Ruby.
Filling in a form
Next let’s look at an example of how to fill out a form, by writing a script to search the GOV.UK website for articles about passports:
require 'ferrum'
browser = Ferrum::Browser.new
browser.goto('https://www.gov.uk/')
element = browser.at_css('input[name=q]')
element.focus
element.type('passport', :enter)
begin
results = browser.at_css('[id=js-results]')
results.css('li a').each do |element|
puts element.text.strip
end
rescue Ferrum::NodeNotFoundError
sleep 0.01
retry
end
browser.quit
The #type method enters the given string into the input element, just as if we had been using a web browser directly. We can then use the enter key to submit the form.
The subsequent code lists out the top results on the following page. Ferrum does’t have implicit wait functionality like Capybara, so if your code depends on elements rendered by JavaScript you may need to implement some wait logic like this yourself.
Executing JavaScript
Ferrum can be used to execute JavaScript, which means you can interact with popup alert dialogs, search interfaces, text editors, simulate drag & drop, and much more.
For example, here’s how to detect which version of jQuery a website is using:
require 'ferrum'
browser = Ferrum::Browser.new
browser.goto('http://jquery.com/')
if browser.evaluate('typeof(jQuery)') == 'undefined'
puts 'Not using jQuery'
else
version = browser.evaluate('jQuery.fn.jquery')
puts "Using jQuery #{version}"
end
browser.quit
Simply call the #evaluate method with the JavaScript expression you want to evaluate. JavaScript values like null, undefined, true, false, strings, arrays, objects are all converted to their Ruby equivalents.