HOWTO extract data from HTML with Ruby

This guide will show you how to parse and extract data from HTML documents with Ruby—a useful technique for scraping websites, analysis and conversion of offline documents, etc.


How to parse HTML with Ruby, 3 ways

The nokogiri gem is a popular Ruby HTML/XML parser which uses libxml2, and is notable for being the default HTML parser for mechanize. Parse HTML with nokogiri using the Nokogiri::HTML method:

require 'nokogiri'

document = Nokogiri::HTML(input)

The oga gem is a Ruby XML/HTML parser with a small native extension. Parse HTML with oga using the Oga.parse_html method:

require 'oga'

document = Oga.parse_html(input)

You might want to use oga if you have difficulties installing nokogiri or problems using nokogiri in alternative Ruby environments like Rubinius.

The nokogumbo gem is a wrapper for gumbo, Google’s pure-C HTML5 parser. Parse HTML with nokogumbo using the Nokogiri::HTML5 method:

require 'nokogumbo'

document = Nokogiri::HTML5(input)

Nokogumbo returns nokogiri data structures, which makes it relatively straightforward to switch to from nokogiri. Most of your existing code will work with nokogumbo, but it won’t always be a 100% drop-in replacement due to parsing differences.

Parsing HTML fragments

You can also parse fragments of HTML instead of complete documents. Use the fragment class method with nokogiri and nokogumbo, and the same as before with oga:

Nokogiri Nokogumbo Oga
require 'nokogiri'

fragment = Nokogiri::HTML.fragment('<span>Chunky bacon</span>')
require 'nokogumbo'

fragment = Nokogiri::HTML5.fragment('<span>Chunky bacon</span>')
require 'oga'

fragment = Oga.parse_html('<span>Chunky bacon</span>')

Searching by CSS selector

The easiest way to identify specific elements in a document is to search for them by CSS selector. It’s familiar for web developers (unlike XPath), and isolates code from simple structural changes to the data (e.g. nesting an element within another element).

Nokogiri provides the #search method, oga provides the #css method. For example, here’s how you would search for all anchor elements within a document:

Nokogiri Oga
document.search('a')
document.css('a')

To search for a single element nokogiri provides the #at method, oga provides the #at_css method. For example, searching for the title element:

Nokogiri Oga
document.at('title')
document.at_css('title')

Traversing every element

Sometimes it’s necessary to traverse every element in a document to extract the information you need, e.g. counting the frequencies of various HTML tags. With nokogiri we can use the #traverse method, with oga we can use the #each_node method:

Nokogiri Oga
tags = Hash.new(0)

document.traverse do |node|
  next unless node.is_a?(Nokogiri::XML::Element)

  tags[node.name] += 1
end
tags = Hash.new(0)

document.each_node do |node|
  next unless node.is_a?(Oga::XML::Element)

  tags[node.name] += 1
end

Both methods return all nodes, so for this example we need to filter out anything that isn’t a Nokogiri::XML::Element object or an Oga::XML::Element object.

Extracting element text

To extract text from an element, use either the #text method or the #inner_text method. For example, to extract the document title:

Nokogiri Oga
document.at('title').text
document.at_css('title').text

Both methods return the same result with nokogiri (#text is an alias for #inner_text), but with oga they behave differently—#inner_text returns the immediate child text of the subject element, not including the text of any child elements.

Extracting attribute values

With nokogiri attributes can be accessed using square brackets notation, with oga they can be accessed with the #get method. For example, here’s how you might extract the description from a meta tag:

Nokogiri Oga
document.at('meta[name="description"]')['content']
document.at_css('meta[name="description"]').get('content')

All 3 libraries differ in how they parse boolean attributes. With nokogiri you’ll get the attribute name; an empty string with nokogumbo; and nil with oga:

Nokogiri Nokogumbo Oga
input['disabled']  # => 'disabled'
input['disabled']  # => ''
input.get('disabled')  # => nil

Extracting attribute hashes

Nokogiri implements Enumerable for attributes, which means you can get a hash of attribute keys and values just by calling the #to_h method (avoid the #attributes method which returns Nokogiri::XML::Attr values). Oga returns an array of attribute objects and therefore requires some extra work to convert to a hash:

Nokogiri Oga
element.to_h
element.attributes.each_with_object({}) { |a, h| h[a.name] = a.value }

Extracting tabular data

Combine searching for elements by CSS selector and extracting text and you can easily extract data tables from your HTML documents. For example, to extract the first table in an HTML document and output the data as comma-separated values:

Nokogiri Oga
require 'csv'

document.at('table').search('tr').each do |row|
  cells = row.search('th, td').map { |cell| cell.text.strip }

  puts CSV.generate_line(cells)
end
require 'csv'

document.at_css('table').css('tr').each do |row|
  cells = row.css('th, td').map { |cell| cell.text.strip }

  puts CSV.generate_line(cells)
end

This is a simplistic example which makes some assumptions and doesn’t support complex table features, but it gives you an idea of what you can achieve with relatively little code.