This guide will show you how to parse HTML tables with Nokogiri.
Step 1: Parse the document
Use the Nokogiri::HTML
method to parse your HTML input:
require 'nokogiri'
document = Nokogiri::HTML(input)
Step 2: Select the target table element
If there is only one table in the document you can select it by the tag name:
table = document.at('table')
If there are a number of tables in the document you can instead use a CSS selector to identify a specific table by its id, or a class name:
table = document.at('#example-id')
table = document.at('.example-table-class')
Alternatively if the table does not have any distinguishing attributes you can select the table positionally, relative to other tables. For example:
tables = document.search('table')
table = tables.last # last table in the document
Step 3: Select the table cell elements
Use the Nokogiri #search method to select descendant table rows, and loop through each row. Then select the descendant table cells in each row. For example:
table.search('tr').each do |tr|
cells = tr.search('th, td')
# output cell data
end
Remember that headings can be specified using th
elements or td
elements. Depending on your project you may want to filter out the headings, or certain rows and columns.
Step 4: Extract and output the cell data
Finally, you can extract and output the data from each cell. For example:
cells.each do |cell|
text = cell.text.strip
puts CSV.generate_line(cells)
end
In most cases the data of interest will be the cell text, but you might want to filter or add to the data based on the cell attributes. This example uses the String#strip method to remove any leading or trailing whitespace from the text, and the CSV.generate_line method to output the cell values as a line of CSV data.