Web scraping is an approach for extracting data from websites that don’t have an API. Web scraping code is inherently “brittle” (prone to breaking over time due to changes in the website content and structure), but it’s a flexible technique with a broad range of uses.
This tutorial will show you how to scrape websites with Ruby and the Watir gem. Watir is powered by Selenium, which means that unlike Mechanize you can use it to interact with JavaScript.
Getting started
Firstly, make sure you have the watir gem installed:
gem install watir
Then copy and paste the following short Ruby script:
require 'watir'
browser = Watir::Browser.new
browser.goto('http://stackoverflow.com/')
puts browser.title
browser.close
Run the script with a Ruby, and you should see the page title of the Stack Overflow website. This demonstrates the core of web scraping: navigating HTTP, and extracting data from HTML. Try changing the URL to a different website.
Extracting data
This next example extracts the title of the most recent post on the Rails blog:
require 'watir'
browser = Watir::Browser.new
browser.goto('http://weblog.rubyonrails.org/')
puts browser.element(css: 'article header h2').text.strip
browser.close
The #element method selects a single element based on the given selector or attributes. You can use the #elements method if you need to select multiple elements.
Watir also provides convenience methods for selecting elements based on tag names. For example, browser.h2
would select the first h2 element on the page, and browser.forms
would select all the form elements on the page.
Following links
This next example follows the "Random article" link on the Wikipedia main page:
require 'watir'
browser = Watir::Browser.new
browser.goto('http://en.wikipedia.org/wiki/Main_Page')
browser.link(text: 'Random article').click
browser.wait_until { browser.h1.text != 'Main Page' }
puts browser.url
browser.close
The #link convenience method is used to find the link based on its text.
The #click method fires the click event on the link element just as it would if you clicked it manually, and the browser #url method returns the address of the current page.
The Watir API is asynchronous and non-blocking in places, which means that you may need to explicitly test for the presence of elements before you interact with them. In this example the #wait_until method is used to wait until the page title has changed—without that line the address of the starting page will be printed out before the click event and load of the new page has time to complete. Try it!
Filling in a form
Next let’s look at an example of how to fill out a form, by writing a script to search the GOV.UK website for articles about passports:
require 'watir'
browser = Watir::Browser.new
browser.goto('https://www.gov.uk/')
browser.input(name: 'q').send_keys('passport', :return)
results = browser.element(id: 'js-results').wait_until(&:present?)
results.elements(css: 'li a').each do |element|
puts element.text.strip
end
browser.close
The #input method is used to find the text input for the query based on its name, and then the #send_keys method enters the given value into the input just as if we had been using a web browser directly. Then we submit the form and list out the top results.
Executing JavaScript
Watir can be used to execute JavaScript, which means you can interact with popup alert dialogs, search interfaces, text editors, simulate drag & drop, and much more.
For example, here’s how to detect which version of jQuery a website is using:
require 'watir'
browser = Watir::Browser.new
browser.goto('http://jquery.com/')
version = browser.execute_script('return jQuery.fn.jquery')
puts "Using jQuery #{version}"
browser.close
Simply call the #execute_script method with the JavaScript expression you want to evaluate. JavaScript values like null, undefined, true, false, strings, arrays, objects are all converted to their Ruby equivalents.
Next steps…
Recommended next steps for learning more about Watir:
- Download the Watir cheat sheet below and try some different Watir methods
- Write the same script using Mechanize and Watir to get a feel for the differences
- Write a script using Watir which scrapes a JavaScript driven search interface