HOWTO decide which Ruby web scraping library to use

If you need to do some web scraping with Ruby it can be confusing knowing where to start. Which Ruby web scraping library should you use? Mechanize, nokogiri, phantomjs, poltergeist, selenium-webdriver, watir, upton, anemone, spidr, scrubyt, or wombat?

As is usually the case it depends (on your project and your requirements), but here are three simple rules you can follow to make it easier to decide.

1. Don’t use anything that isn’t maintained

This rules out the poltergeist gem because it depends on phantomjs, which is no longer maintained. Also older Ruby gems like anemone and scrubyt.

2. Use mechanize if you don’t need JavaScript support

The mechanize gem is probably the easiest Ruby web scraping library to get started with, compared to browser based solutions like selenium-webdriver (both have non Ruby dependencies, which is where most installation problems come from). It’s been around for over a decade, therefore relatively battle tested. It can be used for crawling websites, filling out forms, and getting data out of web pages—so it’s a good 80/20 solution.

Learn how to scrape websites with Mechanize

3. Use watir if you need JavaScript support

If you need to execute JavaScript or take screenshots then use the watir gem. It’s built on top of the selenium-webdriver gem, which allows you to switch between different browser engines like Headless Firefox and Headless Chrome. You might even have it installed already—it’s the default driver for Rails 5 system tests.

Learn how to scrape websites with Watir