This guide will show you how to parse and extract data from HTML documents with Ruby—useful for things such as scraping websites, converting documents to different formats, and document analysis.
How to parse HTML with Ruby, 3 ways
The nokogiri gem is a popular Ruby HTML/XML parser which uses libxml2, and is notable for being the default HTML parser for mechanize. Parse HTML with nokogiri using the Nokogiri::HTML
method:
require 'nokogiri'
document = Nokogiri::HTML(input)
The oga gem is a Ruby XML/HTML parser with a small native extension. Parse HTML with oga using the Oga.parse_html
method:
require 'oga'
document = Oga.parse_html(input)
You might want to use oga if you have difficulties installing nokogiri or problems using nokogiri in alternative Ruby environments like Rubinius.
The nokogumbo gem is a wrapper for gumbo, Google’s pure-C HTML5 parser. Parse HTML with nokogumbo using the Nokogiri::HTML5
method:
require 'nokogumbo'
document = Nokogiri::HTML5(input)
Nokogumbo returns nokogiri data structures, which makes it relatively straightforward to switch to from nokogiri. Most of your existing code will work with nokogumbo, but it won’t always be a 100% drop-in replacement due to parsing differences.
Parsing HTML fragments
You can also parse fragments of HTML instead of complete documents. Use the fragment class method with nokogiri and nokogumbo, and the same as before with oga:
require 'nokogiri'
fragment = Nokogiri::HTML.fragment('<span>Chunky bacon</span>')
Searching by CSS selector
The easiest way to identify specific elements in a document is to search for them by CSS selector. It’s familiar for web developers (unlike XPath), and isolates code from simple structural changes to the data like nesting an element within another element.
Nokogiri provides the #search method, oga provides the #css method. For example, here’s how you would search for all anchor elements within a document:
document.search('a')
To search for a single element nokogiri provides the #at method, oga provides the #at_css method. For example, searching for the title element:
document.at('title')
Traversing every element
Sometimes it’s necessary to traverse every element in a document to extract the information you need, for example counting the frequencies of various HTML tags. With nokogiri we can use the #traverse method, with oga we can use the #each_node method:
tags = Hash.new(0)
document.traverse do |node|
next unless node.is_a?(Nokogiri::XML::Element)
tags[node.name] += 1
end
Both methods return all nodes, so for this example we need to filter out anything that isn’t a Nokogiri::XML::Element
object or an Oga::XML::Element
object.
Extracting element text
To extract text from an element, use either the #text method or the #inner_text method. For example, to extract the document title:
document.at('title').text
Both methods return the same result with nokogiri (#text is an alias for #inner_text), but with oga they behave differently—#inner_text returns the immediate child text of the subject element, not including the text of any child elements.
Extracting attribute values
With nokogiri attributes can be accessed using square brackets notation, with oga they can be accessed with the #get method. For example, here’s how you might extract the description from a meta tag:
document.at('meta[name="description"]')['content']
All 3 libraries differ in how they parse boolean attributes. With nokogiri you’ll get the attribute name; an empty string with nokogumbo; and nil with oga:
input['disabled'] # => 'disabled'
Extracting attribute hashes
Nokogiri implements Enumerable for attributes, which means you can get a hash of attribute keys and values just by calling the #to_h method (avoid the #attributes method which returns Nokogiri::XML::Attr values). Oga returns an array of attribute objects and therefore requires some extra work to convert to a hash:
element.to_h
Extracting tabular data
Combine searching for elements by CSS selector and extracting text and you can easily extract data tables from your HTML documents. For example, to extract the first table in an HTML document and output the data as comma-separated values:
require 'csv'
document.at('table').search('tr').each do |row|
cells = row.search('th, td').map { |cell| cell.text.strip }
puts CSV.generate_line(cells)
end
This is a simplistic example which makes some assumptions and doesn’t support complex table features, but it gives you an idea of what you can achieve with relatively little code.