Web scraping is an approach for extracting data from websites that don’t have an API. Web scraping code is inherently “brittle” (prone to breaking over time due to changes in the website content and structure), but it’s a flexible technique with a broad range of uses.
This tutorial will show you how to scrape websites with Ruby and the Poltergeist gem, a PhantomJS driver for Capybara.
Getting started
Firstly, make sure you have the poltergeist gem installed:
gem install poltergeist
Then copy and paste the following short Ruby script:
require 'capybara/poltergeist'
session = Capybara::Session.new(:poltergeist)
session.visit('http://stackoverflow.com/')
puts session.document.title
Run the script with a Ruby, and you should see the page title of the Stack Overflow website. This demonstrates the core of web scraping: navigating HTTP, and extracting data from HTML. Try changing the URL to a different website.
Extracting data
This next example extracts the title of the most recent post on the Rails blog:
require 'capybara/poltergeist'
session = Capybara::Session.new(:poltergeist)
session.visit('http://weblog.rubyonrails.org/')
element = session.first('article header h2')
puts element.text.strip
The #first method takes a CSS selector string and returns the first matching element. The #text method returns the textual content inside the element, in this case the title of the blog entry. You can instead use the #all method to select multiple elements, or the #find method to select exactly one single element.
Following links
This next example follows the "Random article" link on the Wikipedia main page:
require 'capybara/poltergeist'
session = Capybara::Session.new(:poltergeist)
session.visit('http://en.wikipedia.org/wiki/Main_Page')
session.click_link('Random article')
puts session.current_url
The #click_link convenience method is used to find the link based on its text and then fire a click event just as if you had clicked on the link manually.
The session #current_url method returns the address of the current page.
Filling in a form
Next let’s look at an example of how to fill out a form by writing a script to search the GOV.UK website for articles about passports:
require 'capybara/poltergeist'
session = Capybara::Session.new(:poltergeist)
session.visit('https://www.gov.uk/')
input = @session.fill_in('q', with: 'passport')
input.send_keys :enter
session.all('#js-results li a').each do |element|
puts element.text.strip
end
The #fill_in method is used to enter the search query in the input named "q". Then the #click_button method is used to click the search button, which submits the form.
Executing JavaScript
Poltergesit can be used to execute JavaScript, which means you can interact with popup alert dialogs, search interfaces, text editors, simulate drag & drop, and much more.
For example, here’s how to detect which version of jQuery a website is using:
require 'capybara/poltergeist'
session = Capybara::Session.new(:poltergeist)
session.visit('http://jquery.com/')
version = session.evaluate_script('jQuery.fn.jquery')
puts "Using jQuery #{version}"
Simply call the #evaluate_script method with the JavaScript expression you want to evaluate. JavaScript values like null, undefined, true, false, strings, arrays, objects are all converted to their Ruby equivalents.