HOWTO scrape websites with Ruby & Headless Firefox

Web scraping is an approach for extracting data from websites that don’t have an API. This tutorial will show you how to scrape websites with Ruby and Headless Firefox, using Selenium WebDriver.


Getting started

Firstly, make sure you have all the prerequisites installed. Download and install the Firefox browser. Download and install the geckodriver binary (simply brew install geckodriver if you use Homebrew). Install the selenium-webdriver gem:

gem install selenium-webdriver

(If you already have the gem installed make sure you have version 3.4.1 or higher.)

Then copy and paste the following Ruby script:

require 'selenium-webdriver'

options = Selenium::WebDriver::Firefox::Options.new(args: ['-headless'])

driver = Selenium::WebDriver.for(:firefox, options: options)

driver.get('http://stackoverflow.com/')

puts driver.title

driver.quit

Run the Ruby script, and if everything is working correctly you should see the page title of the Stack Overflow website. This demonstrates the core of web scraping: navigating HTTP, and extracting data from HTML. Try changing the URL to a different website.

Extracting data

This next example extracts the title of the most recent post on the Rails blog:

require 'selenium-webdriver'

options = Selenium::WebDriver::Firefox::Options.new(args: ['-headless'])

driver = Selenium::WebDriver.for(:firefox, options: options)

driver.get('http://weblog.rubyonrails.org/')

element = driver.find_element(css: 'article header h2')

puts element.text.strip

driver.quit

The #find_element method selects a single element based on the given CSS selector, and the #text method returns the textual content inside the element, in this case the title of the latest blog entry. Similarly you can use #find_elements to select multiple elements.

Try changing the selector to see what other elements you can pick out of the page.

Following links

This next example follows the "Random article" link on the Wikipedia main page:

require 'selenium-webdriver'

options = Selenium::WebDriver::Firefox::Options.new(args: ['-headless'])

driver = Selenium::WebDriver.for(:firefox, options: options)

driver.get('http://en.wikipedia.org/wiki/Main_Page')

driver.find_element(link_text: 'Random article').click

puts driver.current_url

driver.quit

The #find_element method is called with the link_text option to select the link. The #click method fires the click event on the link element just as it would if you clicked it manually, and the #current_url method returns the address of the current page.

Filling in a form

Next let’s look at an example of how to fill out a form, by writing a script to search the GOV.UK website for articles about passports:

require 'selenium-webdriver'

options = Selenium::WebDriver::Firefox::Options.new(args: ['-headless'])

driver = Selenium::WebDriver.for(:firefox, options: options)

driver.get('https://www.gov.uk/')

element = driver.find_element(name: 'q')

element.send_keys('passport')

element.submit

results = driver.find_element(id: 'results')

results.find_elements(tag_name: 'h3').each do |h3|
  puts h3.text.strip
end

driver.quit

This time the #find_element is used to select the text input based upon its name attribute. The #send_keys method enters the given value into the input just as if we had been using a web browser directly. Then the #submit method submits the form, and the subsequent code lists out the top results on the following page.

Executing JavaScript

Selenium can be used to execute JavaScript, which means you can interact with popup alert dialogs, search interfaces, text editors, simulate drag & drop, and much more.

For example, here’s how to detect which version of jQuery a website is using:

require 'selenium-webdriver'

options = Selenium::WebDriver::Firefox::Options.new(args: ['-headless'])

driver = Selenium::WebDriver.for(:firefox, options: options)

driver.get('http://jquery.com/')

version = driver.execute_script('return jQuery.fn.jquery')

puts "Using jQuery #{version}"

driver.quit

Simply call the #execute_script method with the JavaScript expression you want to evaluate. JavaScript values like null, undefined, true, false, strings, arrays, objects are all converted to their Ruby equivalents.