Clicky

HOWTO scrape BBC Food recipes with Ruby Mechanize

May 2016, and the BBC have announced plans to cut 11,000+ recipes from its food website (source: www.telegraph.co.uk). Cue suggestions that someone should scrape and download all the recipes.

Speak Ruby? Interested in learning how to scrape the BBC Food recipes using Ruby and the mechanize gem? This tutorial is for you!


Getting started

Firstly, make sure you have the mechanize gem installed:

gem install mechanize

Start your Ruby script by requiring mechanize and creating a new instance:

require 'mechanize'

mechanize = Mechanize.new

mechanize.user_agent_alias = 'Mac Safari'

Be sure to check out the introductory mechanize tutorial if you’re not familiar with mechanize. Pick your favourite alias from Mechanize::AGENT_ALIASES.

Choosing a scraping strategy

The first thing you have to decide when scraping a website is what strategy you are going to use to get all the data you need. Most datasets of interest will be too large to fit on a single page, and you have to figure out what combination of search forms and navigation links to use in order to surface all the data.

Looking at www.bbc.co.uk/food we can see that all the recipes are indexed by chef. If we download the recipes for each chef then that should give us all the recipes. Alternatively you could search for recipes by programme or ingredient.

Fetching a list of all the chefs

The challenge is now to fetch a list of all the chefs, data which is also spread across multiple pages. Starting at the top level "Chefs" page, we can click on each of the A–Z navigation links and extract the list of chefs on each of those pages to give us the full list. Translating that into Ruby gives us code that looks like this:

chefs = []

chefs_url = 'http://www.bbc.co.uk/food/chefs'

chefs_page = mechanize.get(chefs_url)

chefs_page.links_with(href: /\/by\/letters\//).each do |link|
  atoz_page = mechanize.click(link)

  atoz_page.links_with(href: /\A\/food\/chefs\/\w+\z/).each do |link|
    chefs << link.href.split('/').last
  end
end

Notice that instead of clicking each individual chef link we extract the trailing identifier from the URL. This is a simple optimization to avoid an extra request for each chef. If you look at the individual chef pages you can see that the "See all recipes by" links all have a predictable structure, so we don’t need to fetch those intermediary pages.

Fetching the recipes for each chef

With a list of all the chef identifiers we can now search for recipes from each chef, which collectively should give us the entire dataset. Here’s the code:

require 'fileutils'

search_url = 'http://www.bbc.co.uk/food/recipes/search?chefs[]='

chefs.each do |chef_id|
  results_pages = []

  results_pages << mechanize.get(search_url + chef_id)

  dirname = File.join('bbcfood', chef_id)

  FileUtils.mkdir_p(dirname)

  while results_page = results_pages.shift
    links = results_page.links_with(href: /\A\/food\/recipes\/\w+\z/)

    links.each do |link|
      path = File.join(dirname, File.basename(link.href) + '.html')

      next if File.exist?(path)

      STDERR.puts "+ #{path}"

      mechanize.download(link.href, path)
    end

    if next_link = results_page.links.detect { |link| link.rel?('next') }
      results_pages << mechanize.click(next_link)
    end
  end
end

Nothing too advanced but it’s a bit dense, so let’s go through some of the details:


There you have it, a working BBC Food scraper in just 50 lines of code!

If you’ve been following along and running the script yourself you’ll no doubt be getting some 503 Service Unavailable errors which crash your script. 11,000+ requests is a significant amount of traffic in a short space of time.

Want to learn how to handle 503 Service Unavailable errors and make your mechanize scripts more reliable using exponential backoff? Sign up below.