May 2016, and the BBC have announced plans to cut 11,000+ recipes from its food website (source: www.telegraph.co.uk). Cue suggestions that someone should scrape and download all the recipes.
Speak Ruby? Interested in learning how to scrape the BBC Food recipes using Ruby and the mechanize gem? This tutorial is for you!
Getting started
Firstly, make sure you have the mechanize gem installed:
gem install mechanize
Start your Ruby script by requiring mechanize and creating a new instance:
require 'mechanize'
mechanize = Mechanize.new
mechanize.user_agent_alias = 'Mac Safari'
Be sure to check out the introductory mechanize tutorial if you’re not familiar with mechanize. Pick your favourite alias from Mechanize::AGENT_ALIASES
.
Choosing a scraping strategy
The first thing you have to decide when scraping a website is what strategy you are going to use to get all the data you need. Most datasets of interest will be too large to fit on a single page, and you have to figure out what combination of search forms and navigation links to use in order to surface all the data.
Looking at www.bbc.co.uk/food we can see that all the recipes are indexed by chef. If we download the recipes for each chef then that should give us all the recipes. Alternatively you could search for recipes by programme or ingredient.
Fetching a list of all the chefs
The challenge is now to fetch a list of all the chefs, data which is also spread across multiple pages. Starting at the top level "Chefs" page, we can click on each of the A–Z navigation links and extract the list of chefs on each of those pages to give us the full list. Translating that into Ruby gives us code that looks like this:
chefs = []
chefs_url = 'http://www.bbc.co.uk/food/chefs'
chefs_page = mechanize.get(chefs_url)
chefs_page.links_with(href: /\/by\/letters\//).each do |link|
atoz_page = mechanize.click(link)
atoz_page.links_with(href: /\A\/food\/chefs\/\w+\z/).each do |link|
chefs << link.href.split('/').last
end
end
Notice that instead of clicking each individual chef link we extract the trailing identifier from the URL. This is a simple optimization to avoid an extra request for each chef. If you look at the individual chef pages you can see that the "See all recipes by" links all have a predictable structure, so we don’t need to fetch those intermediary pages.
Fetching the recipes for each chef
With a list of all the chef identifiers we can now search for recipes from each chef, which collectively should give us the entire dataset. Here’s the code:
require 'fileutils'
search_url = 'http://www.bbc.co.uk/food/recipes/search?chefs[]='
chefs.each do |chef_id|
results_pages = []
results_pages << mechanize.get(search_url + chef_id)
dirname = File.join('bbcfood', chef_id)
FileUtils.mkdir_p(dirname)
while results_page = results_pages.shift
links = results_page.links_with(href: /\A\/food\/recipes\/\w+\z/)
links.each do |link|
path = File.join(dirname, File.basename(link.href) + '.html')
next if File.exist?(path)
STDERR.puts "+ #{path}"
mechanize.download(link.href, path)
end
if next_link = results_page.links.detect { |link| link.rel?('next') }
results_pages << mechanize.click(next_link)
end
end
end
Nothing too advanced but it’s a bit dense, so let’s go through some of the details:
- All we need to search for recipes for a given chef is a simple GET request. There’s only a single parameter and the chef identifiers are already encoded (they were extracted directly from the link URLs), so we don’t need any parameter encoding logic.
- The FileUtils.mkdir_p method is used to make sure the output directory structure exists.
- The while loop is used in combination with an array as a FIFO queue to implement pagination. The BBC website helpfully tags the "next page" links with a rel=next attribute.
- Each download can be skipped if the file already exists, basic caching which makes it possible to re-run the script without re-downloading everything multiple times.
- Downloading each recipe is simply a case of constructing the local path that we want to save it to, and calling the mechanize download method.
There you have it, a working BBC Food scraper in just 50 lines of code!
If you’ve been following along and running the script yourself you’ll no doubt be getting some 503 Service Unavailable errors which crash your script. 11,000+ requests is a significant amount of traffic in a short space of time.
Want to learn how to handle 503 Service Unavailable errors and make your mechanize scripts more reliable using exponential backoff? Sign up below.