HOWTO scrape website links with Ruby Mechanize

Need to get all the links from a page using Mechanize? Getting stuck extracting the href attribute from the links? It’s straightforward when you know a few details about how Mechanize works, and there’s no need to resort to using complicated xpath expressions!

Scraping all the links on a page

The simplest way to enumerate all the links on a page is to use the Mechanize::Page#links method, which returns an array of Mechanize::Link objects. For example:

require 'mechanize'

mechanize = Mechanize.new

page = mechanize.get(url)

page.links.each do |link|
  puts link.href
end

Mechanize::Link objects have an href attribute method: nice and easy.

Alternatively if you’re selecting the links with the #search method you’ll need to lookup the href attribute a bit differently:

require 'mechanize'

mechanize = Mechanize.new

page = mechanize.get(url)

page.search('a').each do |link|
  puts link['href']
end

The #search method returns generic Nokogiri::XML::Element objects, which don’t have an href attribute method. Instead you can use the square-brackets method to lookup the attribute as you would with any other attribute.

Resolving relative links to absolute links

When following a hyperlink user agents resolve the href attribute in order to convert relative references into absolute URLs. Whilst Mechanize does this for you implicitly when following links, you’ll need to do it explicitly when you’re scraping links.

Use the Mechanize#resolve method to resolve the href attribute into a URI object:

uri = mechanize.resolve(href)

You can then easily compare different URI values from different documents, store them in a database, or pass them to another part of your application for further processing.