Ghost in the Web : Scraping with Phantom and Casper

Getting involved in startup projects and the open data movement in Montreal, I began playing with scraping technologies to crawl the web, sanitize and structure data. In this article, we’ll show how to use CasperJS to fetch and save data. We’ll also demo a script example that uses TOR for anonymity. 

Scraping

Before we plunge into Casper, let’s talk a bit about scraping in general. Web scraping is a technique of crawling a website or webservice’s content, in order to extract data from it. It can be as simple as using wget to crawl a particular domain and download all the PDFs (ah, the undergrad days) or it can involve making multiple requests on a lookup engine and saving the relevant data displayed on the page. Be aware that web scraping can be harmful: if done carelessly, it can overload the target with requests and effectively create a denial of service attack. Furthermore, certain webservices prohibit the use of automated crawlers on their website.

Use of scraping

Scraping is all about amassing data (sometimes a large amount), from one or different sources, in order to present it in another fashion. As an example, Padmapper is a web application that crawls Craigslist and displays places for rent on a convenient map. In the case of my startup project, I’m using multiple sources of information in order to aggregate relevant data.  In public / civic projects, the focus is often on analytics: we herd all the necessary data in a database and we perform statistical analysis on it to answer questions such as  ”which areas of the city have had the most bedbugs?” or  ”what are the top 10 restaurants that were the fined the worst in the last 5 years by the health department?”. Analysis is often done using map reduce, which is also a fascinating topic that we’ll tackle another time.

The problem with Javascript

Many scraping technologies exist. Most notably, I remember using BeautifulSoup and I’ve heard Scrapy is awesome too (if you’re a Pythonista and you’re looking for scraping tricks, Montreal Python is a vibrant community and they’re getting involved in very cool scraping projects). You can also use wget and download whole pages that you can parse with scripts later. The issue with most of those tools is that some pages can be javascript-heavy, and expect a full browser to be present. Without a javascript engine, some pages will simply not render correctly and you can’t get to the data you need.

Enter PhantomJS

PhantomJS is a headless webkit browser that you can control via Javascript. It behaves like most browsers and it allows you to deal with Javascript, CSS, DOM and SVG.

You install it globally via npm:

npm install -g phantomjs

You write a script, something like this (lifted from PhantomJS page):

var page = require('webpage').create();
page.onConsoleMessage = function (msg) {
    console.log('Page title is ' + msg);
};
page.open(url, function (status) {
    page.evaluate(function () {
        console.log(document.title);
    });
});

And then you run it like this:

phantomjs script.js

Presto! You have the result in the console.

That’s nice, but let’s check out CasperJS

CasperJS is another tool that adds a couple of nice things on top of either PhantomJS or SlimerJS (note that SlimerJS is a project similar to PhantomJS, but uses a Gecko engine).

Let’s install the latest version (please note that you need phantomjs or slimerjs to be installed):

$ git clone git://github.com/n1k0/casperjs.git
$ cd casperjs
$ ln -sf `pwd`/bin/casperjs /usr/local/bin/casperjs

The first thing that’s interesting with CasperJS (for me at least) is that it supports Coffeescript out of the box. It also bundles all the cool things about PhantomJS such as CSS selectors, running javascript and taking screenshots, into a very friendly DSL.

Behold (straight from their documentation).

getLinks = ->
  links = document.querySelectorAll "h3.r a"
  Array::map.call links, (e) -> e.getAttribute "href"

links = []
casper = require('casper').create()

casper.start "http://google.fr/", ->
  # search for 'casperjs' from google form
  @fill "form[action='/search']", q: "casperjs", true

casper.then ->
  # aggregate results for the 'casperjs' search
  links = @evaluate getLinks
  # search for 'phantomjs' from google form
  @fill "form[action='/search']", q: "phantomjs", true

casper.then ->
  # concat results for the 'phantomjs' search
  links = links.concat @evaluate(getLinks)

casper.run ->
  # display results
  @echo links.length + " links found:"
  @echo(" - " + links.join("\n - ")).exit()

That’s a good start, but let’s push it a little further.

Using CasperJS

CasperJS is a remarkable tool for testing and I recommend reading their documentation. In this article however, we’ll focus on common scraping tasks and patterns since those are not as well documented.

Taking screenshots

One of the cool features of PhantomJS and CasperJS is taking pictures. Although the browser itself is headless, it is possible to ask PhantomJS to render a page and save it as an image file. This is often used for testing CSS rendering but it can also be a good way to capture images or maps generated on the fly.

Here’s how it works:

casper = require('casper').create()
casper.start 'http://www.google.com/images?q=ghost', ()->
    @.capture 'google-ghosts.png'
casper.run()

So we do a quick google image search for a picture of a ghost, then we render and take a snapshot. Here’s what you get:

screenshot capture by casper

casper screenshot

Neat, but what if you wanted to get just a portion of the page, say, the images? Well, CasperJS allows you to limit your capture using CSS selectors:

casper = require('casper').create()
casper.start 'http://www.google.com/images?q=ghost', ()->
    @.captureSelector 'google-ghosts.png', '#search'
casper.run()

Much better: we limit our capture to google’s div with id ‘search‘ (note the CSS syntax).

no searchbar!

no searchbar!

Downloading files

CasperJS can also download files, which can be very useful when you’re scheduling scripts to get the latest data from a given source. Here’s a quick example using the SEAO data set released recently.

casper = require('casper').create()
url ='http://www.donnees.gouv.qc.ca/?node=/donnees-details&id=542483bf-3ea2-4074-b33c-34828f783995'
casper.start url, ()->
  #let's say we only want 2013 data links
  links = @.getElementsAttribute("a[href*='2013']", 'href')
  #we also use a coffeescript matcher to grab the date of the file to name it
  @download(link, "#{link.match(/2013\d+/)[0]}.zip") for link in links
casper.run()

Granted, we could have just used wget for that one, but I figured it was a simple enough example.

Extracting data

And last but certainly not least, CasperJS allows you to capture text on the screen using CSS selectors, which is the most common way to scrape data. Let’s take an example of scraping a public LinkedIn profile.

casper = require('casper').create()
url = "http://au.linkedin.com/pub/saul-goodman/42/382/563"
casper.start url, ()->
  #The regex is used to remove whitespace
  firstName = @.getHTML('span.given-name').replace /^\s+|\s+$/g, ""
  lastName = @.getHTML('span.family-name').replace /^\s+|\s+$/g, ""
  location = @.getHTML('span.locality').replace /^\s+|\s+$/g, ""
  industry = @.getHTML('dd.industry').replace /^\s+|\s+$/g, ""
  console.log "We got #{firstName} #{lastName}, from #{location} that works in #{industry}"
casper.run()

Got him!

Full scraper example

Now let’s make a more complete scraper example. Let’s say we want to look up a profile on LinkedIn again, and crawl through a few of their possible associates while gathering data.

casper = require('casper').create()
fs = require('fs')

url = 'http://www.linkedin.com/pub/gustavo-fring/61/6a/281'
suspects = []

#We start with Gus
casper.start url, ()->
  links = @.getElementsAttribute("strong > a", 'href')
  getAssociate(link, @) for link in links

#We run the job, then we write to file
casper.run ->
  html = '<table><tr><td>Mugshot</td><td>Name</td><td>Location</td><td>Industry</td></tr>'
  html = (html + generateRow(suspect)) for suspect in suspects
  html += '</table>'
  fs.write 'suspect.html', html, 'w'
  @echo("\nexecution terminated\n").exit()

#helper function to generate a row
generateRow = (suspect)->
  data = """
    <tr>
      <td>
        <img src="#{suspect.lastName}.png">
      </td>
      <td>#{suspect.firstName} #{suspect.lastName}</td>
      <td>#{suspect.location}</td>
      <td>#{suspect.industry}</td>
    </tr>
  """

#helper function, grabs all the associate links in the page
getAssociate = (link, doc) ->
  casper.thenOpen link, =>
    associate = getContactInfo(doc)
    suspects.push associate

#helper function, takes a picture and creates a JS object with extracted data
getContactInfo = (doc)->
  contact = 
    firstName: doc.getHTML('span.given-name').replace /^\s+|\s+$/g, ""
    lastName: doc.getHTML('span.family-name').replace /^\s+|\s+$/g, ""
    location: doc.getHTML('span.locality').replace /^\s+|\s+$/g, ""
    industry: doc.getHTML('dd.industry').replace /^\s+|\s+$/g, ""
  doc.captureSelector "#{contact.lastName}.png", 'div.image.zoomable > img.photo'
  contact

Let’s step through.

casper.start url, ()->
  links = @.getElementsAttribute("strong > a", 'href')
  getAssociate(link, @) for link in links

First we start Casper by giving it the main LinkedIn page to look at, and we extract all the relevant links. In our case, links to other profiles are usually within a strong tag, so that’s what we use as a selector.

Then, we call getAssociate which looks like…

getAssociate = (link, doc) ->
  casper.thenOpen link, ->
    associate = getContactInfo(doc)
    suspects.push associate

A simple function that takes a link and visits it with casper, before extracting contact info and pushing the corresponding object in our list of suspects. Here’s getContactInfo:

getContactInfo = (doc)->
  contact = 
    firstName: doc.getHTML('span.given-name').replace /^\s+|\s+$/g, ""
    lastName: doc.getHTML('span.family-name').replace /^\s+|\s+$/g, ""
    location: doc.getHTML('span.locality').replace /^\s+|\s+$/g, ""
    industry: doc.getHTML('dd.industry').replace /^\s+|\s+$/g, ""
  doc.captureSelector "#{contact.lastName}.png", 'div.image.zoomable > img.photo'
  contact

The only notable difference with the previous example is that we also use captureSelector to go grab a snapshot of the profile picture.

Finally, we supply a callback to casper.run in order to do post-processing steps.

casper.run ->
  html = '<table><tr><td>Mugshot</td><td>Name</td><td>Location</td><td>Industry</td></tr>'
  html = (html + generateRow(suspect)) for suspect in suspects
  html += '</table>'
  fs.write 'suspect.html', html, 'w'
  @echo("\nexecution terminated\n").exit()

In this case, we generate a little HTML file with our gathered data, which looks like this:

Screen Shot 2013-09-12 at 4.09.06 PM

Using Tor

Hold on, maybe crawling all those profiles of suspicious characters may not be a good idea after all. Perhaps we should at least cover our tracks a little, by running our script through Tor. Tor is an interesting service offering to relay your requests through a network of proxies in order to anonymize traffic. It certainly has its pitfalls, but that’s not the focus of our current article.

On a mac, you can install TOR via Homebrew

sudo brew install tor

You then start tor with the tor command. You should see a line that looks like this:

[notice] Opening Socks listener on 127.0.0.1:9050

This is the proxy you want to be using.  All you need to do is pass it in your casper call, along with the proxy type and you’re good!

casperjs --proxy=127.0.0.1:9050 --proxy-type=socks5 scraper-hermanos.coffee

There you have it! Have fun scraping the web and creating interesting data mashups. Make sure to verify the terms and services of the data sources you will be using!

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>