HOWTO scrape websites with Ruby & Watir

Web scraping is an approach for extracting data from websites that don’t have an API. Web scraping code is inherently “brittle” (prone to breaking over time due to changes in the website content and structure), but it’s a flexible technique with a broad range of uses.

This tutorial will show you how to scrape websites with Ruby and the Watir gem. Watir is powered by Selenium, which means that unlike Mechanize you can use it to interact with JavaScript.


Getting started

Firstly, make sure you have the watir gem installed:

gem install watir

Then copy and paste the following short Ruby script:

require 'watir'

browser = Watir::Browser.new

browser.goto('http://stackoverflow.com/')

puts browser.title

browser.close

Run the script with a Ruby, and you should see the page title of the Stack Overflow website. This demonstrates the core of web scraping: navigating HTTP, and extracting data from HTML. Try changing the URL to a different website.

Extracting data

This next example extracts the title of the most recent post on the Rails blog:

require 'watir'

browser = Watir::Browser.new

browser.goto('http://weblog.rubyonrails.org/')

puts browser.element(css: 'article header h2').text.strip

browser.close

The #element method selects a single element based on the given selector or attributes. You can use the #elements method if you need to select multiple elements.

Watir also provides convenience methods for selecting elements based on tag names. For example, browser.h2 would select the first h2 element on the page, and browser.forms would select all the form elements on the page.

Following links

This next example follows the "Random article" link on the Wikipedia main page:

require 'watir'

browser = Watir::Browser.new

browser.goto('http://en.wikipedia.org/wiki/Main_Page')

browser.link(text: 'Random article').click

browser.wait_until { browser.h1.text != 'Main Page' }

puts browser.url

browser.close

The #link convenience method is used to find the link based on its text.

The #click method fires the click event on the link element just as it would if you clicked it manually, and the browser #url method returns the address of the current page.

The Watir API is asynchronous and non-blocking in places, which means that you may need to explicitly test for the presence of elements before you interact with them. In this example the #wait_until method is used to wait until the page title has changed—without that line the address of the starting page will be printed out before the click event and load of the new page has time to complete. Try it!

Filling in a form

Next let’s look at an example of how to fill out a form, by writing a script to search the GOV.UK website for articles about passports:

require 'watir'

browser = Watir::Browser.new

browser.goto('https://www.gov.uk/')

browser.input(name: 'q').send_keys('passport')

browser.input(value: 'Search').click

browser.div(id: 'results').h3s.each do |h3|
  puts h3.text.strip
end

browser.quit

The #input method is used to find the text input for the query based on its name, and then the #send_keys method enters the given value into the input just as if we had been using a web browser directly. Then we submit the form and list out the top results.

Executing JavaScript

Watir can be used to execute JavaScript, which means you can interact with popup alert dialogs, search interfaces, text editors, simulate drag & drop, and much more.

For example, here’s how to detect which version of jQuery a website is using:

require 'watir'

browser = Watir::Browser.new

browser.goto('http://jquery.com/')

version = browser.execute_script('return jQuery.fn.jquery')

puts "Using jQuery #{version}"

browser.quit

Simply call the #execute_script method with the JavaScript expression you want to evaluate. JavaScript values like null, undefined, true, false, strings, arrays, objects are all converted to their Ruby equivalents.


Next steps to learn more…