Tutorial: Parallel web scraping with CasperJS and GNU Parallel

A few weeks ago, I had to write a web scraper that processed a long list of similar web pages and pulled data from each website into a single text file. I used CasperJS as the web scraping engine, which is fairly simple to set up. My first approach was to launch a single instance of Casper, then visit each web page sequentially. This worked, but took a long time. What I really wanted to do was to have Casper visit and scrape multiple web pages at the same time. Unfortunately, this functionality was not officially supported by Casper.

My eventual solution was to run multiple instances of CasperJS simultaneously using the popular GNU Parallel tool. Instead of a single instance of a script processing a list of links, each script instance would only handle one link. GNU Parallel would take care of launching multiple simultaneous instances of my script, each with a different link.

Prerequisites

You’ll need to install PhantomJS globally, and then CasperJS which I recommend installing locally. Then, install the GNU Parallel library (Mac users can install the parallel package with Homebrew).

This tutorial will scrape the WebElements website, which contains information about every one of the 118 elements of the periodic table. The scraping script itself will be fairly simple, as the point here is to illustrate how to run multiple instances of it in parallel.

The scraper

Our scraper will take in a URL as a command-line argument. It will then scrape the title of the page and the first informational paragraph on the page. This data will then be written to a text file identified by a unique string. We’ll use the casper.cli module to read arguments from the command line, and PhantomJS’s fs module (note: not NodeJS’s fs module).

var fs = require('fs');
var casper = require('casper').create();
var pageTitle = undefined;
casper.start(casper.cli.get(0));
casper.then(function() {
 var rawTitle = this.getTitle();
 pageTitle = rawTitle.substring(0, rawTitle.indexOf('»'));
 firstParagraphText = this.fetchText('#main .p_first:first-of-type');
 data = pageTitle + "\n" + firstParagraphText + "\n";
 fs.write('scraped_data/' + pageTitle + '.txt', data, 'w');
});
casper.run(function() {
 this.echo("Data saved for " + pageTitle);
 this.exit();
});

For your scraper, put the bulk of the scraping work between the calls to casper.start and casper.run. If you are writing data to a file, make sure the file name is unique for every instance of your scraper launched. In this case, the pageTitle, which is the name of the element, is used as the unique string for the text file.

You can now test out the scraper by calling it with a URL:

casperjs scraper.js https://www.webelements.com/manganese/

Parallelize the scraper

There are 118 links we want to scrape, one for every element on the periodic table. Put these links in a text file; they do not have to be in any particular order.

https://www.webelements.com/hydrogen/
https://www.webelements.com/helium/
https://www.webelements.com/lithium/
https://www.webelements.com/beryllium/
https://www.webelements.com/boron/
https://www.webelements.com/carbon/
https://www.webelements.com/nitrogen/
https://www.webelements.com/oxygen/
https://www.webelements.com/fluorine/
https://www.webelements.com/neon/
https://www.webelements.com/sodium/
https://www.webelements.com/magnesium/
https://www.webelements.com/aluminium/
https://www.webelements.com/silicon/
https://www.webelements.com/phosphorus/
https://www.webelements.com/sulfur/
https://www.webelements.com/chlorine/
https://www.webelements.com/argon/
https://www.webelements.com/potassium/
https://www.webelements.com/calcium/
https://www.webelements.com/scandium/
https://www.webelements.com/titanium/
https://www.webelements.com/vanadium/
https://www.webelements.com/chromium/
https://www.webelements.com/manganese/
https://www.webelements.com/iron/
https://www.webelements.com/cobalt/
https://www.webelements.com/nickel/
https://www.webelements.com/copper/
https://www.webelements.com/zinc/
https://www.webelements.com/gallium/
https://www.webelements.com/germanium/
https://www.webelements.com/arsenic/
https://www.webelements.com/selenium/
https://www.webelements.com/bromine/
https://www.webelements.com/krypton/
https://www.webelements.com/rubidium/
https://www.webelements.com/strontium/
https://www.webelements.com/yttrium/
https://www.webelements.com/zirconium/
https://www.webelements.com/niobium/
https://www.webelements.com/molybdenum/
https://www.webelements.com/technetium/
https://www.webelements.com/ruthenium/
https://www.webelements.com/rhodium/
https://www.webelements.com/palladium/
https://www.webelements.com/silver/
https://www.webelements.com/cadmium/
https://www.webelements.com/indium/
https://www.webelements.com/tin/
https://www.webelements.com/antimony/
https://www.webelements.com/tellurium/
https://www.webelements.com/iodine/
https://www.webelements.com/xenon/
https://www.webelements.com/caesium/
https://www.webelements.com/barium/
https://www.webelements.com/lutetium/
https://www.webelements.com/hafnium/
https://www.webelements.com/tantalum/
https://www.webelements.com/tungsten/
https://www.webelements.com/rhenium/
https://www.webelements.com/osmium/
https://www.webelements.com/iridium/
https://www.webelements.com/platinum/
https://www.webelements.com/gold/
https://www.webelements.com/mercury/
https://www.webelements.com/thallium/
https://www.webelements.com/lead/
https://www.webelements.com/bismuth/
https://www.webelements.com/polonium/
https://www.webelements.com/astatine/
https://www.webelements.com/radon/
https://www.webelements.com/francium/
https://www.webelements.com/radium/
https://www.webelements.com/lawrencium/
https://www.webelements.com/rutherfordium/
https://www.webelements.com/dubnium/
https://www.webelements.com/seaborgium/
https://www.webelements.com/bohrium/
https://www.webelements.com/hassium/
https://www.webelements.com/meitnerium/
https://www.webelements.com/darmstadtium/
https://www.webelements.com/roentgenium/
https://www.webelements.com/copernicium/
https://www.webelements.com/nihonium/
https://www.webelements.com/flerovium/
https://www.webelements.com/moscovium/
https://www.webelements.com/livermorium/
https://www.webelements.com/tennessine/
https://www.webelements.com/oganesson/
https://www.webelements.com/lanthanum/
https://www.webelements.com/cerium/
https://www.webelements.com/praseodymium/
https://www.webelements.com/neodymium/
https://www.webelements.com/promethium/
https://www.webelements.com/samarium/
https://www.webelements.com/europium/
https://www.webelements.com/gadolinium/
https://www.webelements.com/terbium/
https://www.webelements.com/dysprosium/
https://www.webelements.com/holmium/
https://www.webelements.com/erbium/
https://www.webelements.com/thulium/
https://www.webelements.com/ytterbium/
https://www.webelements.com/actinium/
https://www.webelements.com/thorium/
https://www.webelements.com/protactinium/
https://www.webelements.com/uranium/
https://www.webelements.com/neptunium/
https://www.webelements.com/plutonium/
https://www.webelements.com/americium/
https://www.webelements.com/curium/
https://www.webelements.com/berkelium/
https://www.webelements.com/californium/
https://www.webelements.com/einsteinium/
https://www.webelements.com/fermium/
https://www.webelements.com/mendelevium/
https://www.webelements.com/nobelium/

The idea is to use Parallel to read the text file line by line, and for each line, launch an instance of the scraper, passing it a URL. This is done with one line in Bash:

parallel -a links.txt ./node_modules/.bin/casperjs scraper.js

The -a flag reads the given text file line by line, passing each line as a parameter to ./node_modules/.bin/casperjs scraper.js. You can see a visual progress bar by adding --bar as a flag to parallel.

Gather the files

Each instance of the scraper produces its own output file. To join all these output files together, with no guaranteed order:

cat scraped_data/*.txt > elements.txt

Full source code

A full, working demo of the parallel scraping technique discussed in this post is available at https://github.com/g-liu/parallel-scraping.