HOWTO parse HTML tables with Nokogiri

This guide will show you how to parse HTML tables with Nokogiri.


Step 1: Parse the document

Use the Nokogiri::HTML method to parse your HTML input:

require 'nokogiri'

document = Nokogiri::HTML(input)

Step 2: Select the target table element

If there is only one table in the document you can select it by the tag name:

table = document.at('table')

If there are a number of tables in the document you can instead use a CSS selector to identify a specific table by its id, or a class name:

table = document.at('#example-id')

table = document.at('.example-table-class')

Alternatively if the table does not have any distinguishing attributes you can select the table positionally, relative to other tables. For example:

tables = document.search('table')

table = tables.last # last table in the document

Step 3: Select the table cell elements

Use the Nokogiri #search method to select descendant table rows, and loop through each row. Then select the descendant table cells in each row. For example:

table.search('tr').each do |tr|
  cells = tr.search('th, td')

  # output cell data
end

Remember that headings can be specified using th elements or td elements. Depending on your project you may want to filter out the headings, or certain rows and columns.

Step 4: Extract and output the cell data

Finally, you can extract and output the data from each cell. For example:

cells.each do |cell|
  text = cell.text.strip

  puts CSV.generate_line(cells)
end

In most cases the data of interest will be the cell text, but you might want to filter or add to the data based on the cell attributes. This example uses the String#strip method to remove any leading or trailing whitespace from the text, and the CSV.generate_line method to output the cell values as a line of CSV data.

Get your free Mechanize & Nokogiri cheat sheet

PLUS: actionable tips on how to improve your Ruby code and advice on how to increase your value as a Ruby developer, delivered straight to your inbox. No spam, ever.

Unsubscribe at any time.