HOWTO parse HTML with Ruby & Nokogiri

This guide will show you how to parse and extract data from HTML documents with Ruby and Nokogiri. Nokogiri is a popular Ruby HTML/XML parser which uses libxml2, and is notable for being the default HTML parser for Mechanize.


Parsing an HTML document

Use the Nokogiri::HTML method to parse your HTML input:

require 'nokogiri'

document = Nokogiri::HTML(input)

You can also parse fragments of HTML using the Nokogiri::HTML.fragment method:

fragment = Nokogiri::HTML.fragment('<span>Chunky bacon</span>')

Searching by CSS selector

The easiest way to identify specific elements in a document is to search for them by CSS selector. It’s familiar for web developers (unlike XPath), and isolates code from simple structural changes to the data (e.g. nesting an element within another element).

For example, here’s how you would search for all anchor elements within a document:

links = document.search('a')

To search for a single element use the #at method instead. For example:

title = document.at('title')

Extracting element text

To extract text from an element, use either the #text method or the #inner_text method. For example, to extract the document title:

title = document.at('title').text

Extracting attribute values

Attributes can be accessed using square brackets notation. For example, here’s how you might extract the description from a meta tag:

description = document.at('meta[name="description"]')['content']

Extracting attribute hashes

Nokogiri implements Enumerable for attributes, which means you can get a hash of attribute keys and values just by calling the #to_h method—avoid the #attributes method which returns Nokogiri::XML::Attr values. For example:

attributes = element.to_h

Traversing every element

Sometimes it’s necessary to traverse every element in a document to extract the information you need, e.g. counting the frequencies of various HTML tags. With nokogiri you can use the #traverse method, for example:

tags = Hash.new(0)

document.traverse do |node|
  next unless node.is_a?(Nokogiri::XML::Element)

  tags[node.name] += 1
end

The #traverse method returns all the nodes in the document, so for this example we need to filter out anything that isn’t a Nokogiri::XML::Element object.

Extracting tabular data

Combine searching for elements by CSS selector and extracting text and you can easily extract data tables from your HTML documents. For example, to extract the first table in an HTML document and output the data as comma-separated values:

require 'csv'

document.at('table').search('tr').each do |row|
  cells = row.search('th, td').map { |cell| cell.text.strip }

  puts CSV.generate_line(cells)
end

This is a simplistic example which makes some assumptions and doesn’t support complex table features, but it gives you an idea of what you can achieve with relatively little code.


Looking for an HTML parser that’s easier to install? Try oga.

Need a quick reference for commonly used nokogiri methods? Sign-up below.