This guide will show you how to parse and extract data from HTML documents with Ruby and Nokogiri. Nokogiri is a popular Ruby HTML/XML parser which uses libxml2, and is notable for being the default HTML parser for Mechanize.
Parsing an HTML document
Use the Nokogiri::HTML
method to parse your HTML input:
require 'nokogiri'
document = Nokogiri::HTML(input)
You can also parse fragments of HTML using the Nokogiri::HTML.fragment
method:
fragment = Nokogiri::HTML.fragment('<span>Chunky bacon</span>')
Searching by CSS selector
The easiest way to identify specific elements in a document is to search for them by CSS selector. It’s familiar for web developers (unlike XPath), and isolates code from simple structural changes to the data like nesting an element within another element.
For example, here’s how you would search for all anchor elements within a document:
links = document.search('a')
To search for a single element use the #at method instead. For example:
title = document.at('title')
Extracting element text
To extract text from an element, use either the #text method or the #inner_text method. For example, to extract the document title:
title = document.at('title').text
Extracting attribute values
Attributes can be accessed using square brackets notation. For example, here’s how you might extract the description from a meta tag:
description = document.at('meta[name="description"]')['content']
Extracting attribute hashes
Nokogiri implements Enumerable for attributes, which means you can get a hash of attribute keys and values just by calling the #to_h method—avoid the #attributes method which returns Nokogiri::XML::Attr
values. For example:
attributes = element.to_h
Traversing every element
Sometimes it’s necessary to traverse every element in a document to extract the information you need, for example counting the frequencies of various HTML tags. With nokogiri you can use the #traverse method, for example:
tags = Hash.new(0)
document.traverse do |node|
next unless node.is_a?(Nokogiri::XML::Element)
tags[node.name] += 1
end
The #traverse method returns all the nodes in the document, so for this example we need to filter out anything that isn’t a Nokogiri::XML::Element
object.
Extracting tabular data
Combine searching for elements by CSS selector and extracting text and you can easily extract data tables from your HTML documents. For example, to extract the first table in an HTML document and output the data as comma-separated values:
require 'csv'
document.at('table').search('tr').each do |row|
cells = row.search('th, td').map { |cell| cell.text.strip }
puts CSV.generate_line(cells)
end
This is a simplistic example which makes some assumptions and doesn’t support complex table features, but it gives you an idea of what you can achieve with relatively little code.
Looking for an HTML parser that’s easier to install? Try oga.
Need a quick reference for commonly used nokogiri methods? Sign-up below.