Need to get all the links from a page using Mechanize? Getting stuck extracting the href attribute from the links? It’s straightforward when you know a few details about how Mechanize works, and there’s no need to resort to using complicated xpath expressions!
Scraping all the links on a page
The simplest way to enumerate all the links on a page is to use the Mechanize::Page#links method, which returns an array of Mechanize::Link objects. For example:
require 'mechanize' mechanize = Mechanize.new page = mechanize.get(url) page.links.each do |link| puts link.href end
Mechanize::Link objects have an href attribute method: nice and easy.
Alternatively if you’re selecting the links with the #search method you’ll need to lookup the href attribute a bit differently:
require 'mechanize' mechanize = Mechanize.new page = mechanize.get(url) page.search('a').each do |link| puts link['href'] end
The #search method returns generic Nokogiri::XML::Element objects, which don’t have an href attribute method. Instead you can use the square-brackets method to lookup the attribute as you would with any other attribute.
Resolving relative links to absolute links
When following a hyperlink user agents resolve the href attribute in order to convert relative references into absolute URLs. Whilst Mechanize does this for you implicitly when following links, you’ll need to do it explicitly when you’re scraping links.
Use the Mechanize#resolve method to resolve the href attribute into a URI object:
uri = mechanize.resolve(href)
You can then easily compare different URI values from different documents, store them in a database, or pass them to another part of your application for further processing.