Ghost in the Web : Scraping with Phantom and Casper

Getting involved in startup projects and the open data movement in Montreal, I began playing with scraping technologies to crawl the web, sanitize and structure data. In this article, we’ll show how to use CasperJS to fetch and save data. We’ll also demo a script example that uses TOR for anonymity. 

Scraping

Before we plunge into Casper, let’s talk a bit about scraping in general. Web scraping is a technique of crawling a website or webservice’s content, in order to extract data from it. It can be as simple as using wget to crawl a particular domain and download all the PDFs (ah, the undergrad days) or it can involve making multiple requests on a lookup engine and saving the relevant data displayed on the page. Be aware that web scraping can be harmful: if done carelessly, it can overload the target with requests and effectively create a denial of service attack. Furthermore, certain webservices prohibit the use of automated crawlers on their website.

Use of scraping

Scraping is all about amassing data (sometimes a large amount), from one or different sources, in order to present it in another fashion. As an example, Padmapper is a web application that crawls Craigslist and displays places for rent on a convenient map. In the case of my startup project, I’m using multiple sources of information in order to aggregate relevant data.  In public / civic projects, the focus is often on analytics: we herd all the necessary data in a database and we perform statistical analysis on it to answer questions such as  ”which areas of the city have had the most bedbugs?” or  ”what are the top 10 restaurants that were the fined the worst in the last 5 years by the health department?”. Analysis is often done using map reduce, which is also a fascinating topic that we’ll tackle another time.

The problem with Javascript

Many scraping technologies exist. Most notably, I remember using BeautifulSoup and I’ve heard Scrapy is awesome too (if you’re a Pythonista and you’re looking for scraping tricks, Montreal Python is a vibrant community and they’re getting involved in very cool scraping projects). You can also use wget and download whole pages that you can parse with scripts later. The issue with most of those tools is that some pages can be javascript-heavy, and expect a full browser to be present. Without a javascript engine, some pages will simply not render correctly and you can’t get to the data you need.

Enter PhantomJS

PhantomJS is a headless webkit browser that you can control via Javascript. It behaves like most browsers and it allows you to deal with Javascript, CSS, DOM and SVG.

You install it globally via npm:

npm install -g phantomjs

You write a script, something like this (lifted from PhantomJS page):

var page = require('webpage').create();
page.onConsoleMessage = function (msg) {
    console.log('Page title is ' + msg);
};
page.open(url, function (status) {
    page.evaluate(function () {
        console.log(document.title);
    });
});

And then you run it like this:

phantomjs script.js

Presto! You have the result in the console.

That’s nice, but let’s check out CasperJS

CasperJS is another tool that adds a couple of nice things on top of either PhantomJS or SlimerJS (note that SlimerJS is a project similar to PhantomJS, but uses a Gecko engine).

Let’s install the latest version (please note that you need phantomjs or slimerjs to be installed):

$ git clone git://github.com/n1k0/casperjs.git
$ cd casperjs
$ ln -sf `pwd`/bin/casperjs /usr/local/bin/casperjs

The first thing that’s interesting with CasperJS (for me at least) is that it supports Coffeescript out of the box. It also bundles all the cool things about PhantomJS such as CSS selectors, running javascript and taking screenshots, into a very friendly DSL.

Behold (straight from their documentation).

getLinks = ->
  links = document.querySelectorAll "h3.r a"
  Array::map.call links, (e) -> e.getAttribute "href"

links = []
casper = require('casper').create()

casper.start "http://google.fr/", ->
  # search for 'casperjs' from google form
  @fill "form[action='/search']", q: "casperjs", true

casper.then ->
  # aggregate results for the 'casperjs' search
  links = @evaluate getLinks
  # search for 'phantomjs' from google form
  @fill "form[action='/search']", q: "phantomjs", true

casper.then ->
  # concat results for the 'phantomjs' search
  links = links.concat @evaluate(getLinks)

casper.run ->
  # display results
  @echo links.length + " links found:"
  @echo(" - " + links.join("\n - ")).exit()

That’s a good start, but let’s push it a little further.

Using CasperJS

CasperJS is a remarkable tool for testing and I recommend reading their documentation. In this article however, we’ll focus on common scraping tasks and patterns since those are not as well documented.

Taking screenshots

One of the cool features of PhantomJS and CasperJS is taking pictures. Although the browser itself is headless, it is possible to ask PhantomJS to render a page and save it as an image file. This is often used for testing CSS rendering but it can also be a good way to capture images or maps generated on the fly.

Here’s how it works:

casper = require('casper').create()
casper.start 'http://www.google.com/images?q=ghost', ()->
    @.capture 'google-ghosts.png'
casper.run()

So we do a quick google image search for a picture of a ghost, then we render and take a snapshot. Here’s what you get:

screenshot capture by casper

casper screenshot

Neat, but what if you wanted to get just a portion of the page, say, the images? Well, CasperJS allows you to limit your capture using CSS selectors:

casper = require('casper').create()
casper.start 'http://www.google.com/images?q=ghost', ()->
    @.captureSelector 'google-ghosts.png', '#search'
casper.run()

Much better: we limit our capture to google’s div with id ‘search‘ (note the CSS syntax).

no searchbar!

no searchbar!

Downloading files

CasperJS can also download files, which can be very useful when you’re scheduling scripts to get the latest data from a given source. Here’s a quick example using the SEAO data set released recently.

casper = require('casper').create()
url ='http://www.donnees.gouv.qc.ca/?node=/donnees-details&id=542483bf-3ea2-4074-b33c-34828f783995'
casper.start url, ()->
  #let's say we only want 2013 data links
  links = @.getElementsAttribute("a[href*='2013']", 'href')
  #we also use a coffeescript matcher to grab the date of the file to name it
  @download(link, "#{link.match(/2013\d+/)[0]}.zip") for link in links
casper.run()

Granted, we could have just used wget for that one, but I figured it was a simple enough example.

Extracting data

And last but certainly not least, CasperJS allows you to capture text on the screen using CSS selectors, which is the most common way to scrape data. Let’s take an example of scraping a public LinkedIn profile.

casper = require('casper').create()
url = "http://au.linkedin.com/pub/saul-goodman/42/382/563"
casper.start url, ()->
  #The regex is used to remove whitespace
  firstName = @.getHTML('span.given-name').replace /^\s+|\s+$/g, ""
  lastName = @.getHTML('span.family-name').replace /^\s+|\s+$/g, ""
  location = @.getHTML('span.locality').replace /^\s+|\s+$/g, ""
  industry = @.getHTML('dd.industry').replace /^\s+|\s+$/g, ""
  console.log "We got #{firstName} #{lastName}, from #{location} that works in #{industry}"
casper.run()

Got him!

Full scraper example

Now let’s make a more complete scraper example. Let’s say we want to look up a profile on LinkedIn again, and crawl through a few of their possible associates while gathering data.

casper = require('casper').create()
fs = require('fs')

url = 'http://www.linkedin.com/pub/gustavo-fring/61/6a/281'
suspects = []

#We start with Gus
casper.start url, ()->
  links = @.getElementsAttribute("strong > a", 'href')
  getAssociate(link, @) for link in links

#We run the job, then we write to file
casper.run ->
  html = '<table><tr><td>Mugshot</td><td>Name</td><td>Location</td><td>Industry</td></tr>'
  html = (html + generateRow(suspect)) for suspect in suspects
  html += '</table>'
  fs.write 'suspect.html', html, 'w'
  @echo("\nexecution terminated\n").exit()

#helper function to generate a row
generateRow = (suspect)->
  data = """
    <tr>
      <td>
        <img src="#{suspect.lastName}.png">
      </td>
      <td>#{suspect.firstName} #{suspect.lastName}</td>
      <td>#{suspect.location}</td>
      <td>#{suspect.industry}</td>
    </tr>
  """

#helper function, grabs all the associate links in the page
getAssociate = (link, doc) ->
  casper.thenOpen link, =>
    associate = getContactInfo(doc)
    suspects.push associate

#helper function, takes a picture and creates a JS object with extracted data
getContactInfo = (doc)->
  contact = 
    firstName: doc.getHTML('span.given-name').replace /^\s+|\s+$/g, ""
    lastName: doc.getHTML('span.family-name').replace /^\s+|\s+$/g, ""
    location: doc.getHTML('span.locality').replace /^\s+|\s+$/g, ""
    industry: doc.getHTML('dd.industry').replace /^\s+|\s+$/g, ""
  doc.captureSelector "#{contact.lastName}.png", 'div.image.zoomable > img.photo'
  contact

Let’s step through.

casper.start url, ()->
  links = @.getElementsAttribute("strong > a", 'href')
  getAssociate(link, @) for link in links

First we start Casper by giving it the main LinkedIn page to look at, and we extract all the relevant links. In our case, links to other profiles are usually within a strong tag, so that’s what we use as a selector.

Then, we call getAssociate which looks like…

getAssociate = (link, doc) ->
  casper.thenOpen link, ->
    associate = getContactInfo(doc)
    suspects.push associate

A simple function that takes a link and visits it with casper, before extracting contact info and pushing the corresponding object in our list of suspects. Here’s getContactInfo:

getContactInfo = (doc)->
  contact = 
    firstName: doc.getHTML('span.given-name').replace /^\s+|\s+$/g, ""
    lastName: doc.getHTML('span.family-name').replace /^\s+|\s+$/g, ""
    location: doc.getHTML('span.locality').replace /^\s+|\s+$/g, ""
    industry: doc.getHTML('dd.industry').replace /^\s+|\s+$/g, ""
  doc.captureSelector "#{contact.lastName}.png", 'div.image.zoomable > img.photo'
  contact

The only notable difference with the previous example is that we also use captureSelector to go grab a snapshot of the profile picture.

Finally, we supply a callback to casper.run in order to do post-processing steps.

casper.run ->
  html = '<table><tr><td>Mugshot</td><td>Name</td><td>Location</td><td>Industry</td></tr>'
  html = (html + generateRow(suspect)) for suspect in suspects
  html += '</table>'
  fs.write 'suspect.html', html, 'w'
  @echo("\nexecution terminated\n").exit()

In this case, we generate a little HTML file with our gathered data, which looks like this:

Screen Shot 2013-09-12 at 4.09.06 PM

Using Tor

Hold on, maybe crawling all those profiles of suspicious characters may not be a good idea after all. Perhaps we should at least cover our tracks a little, by running our script through Tor. Tor is an interesting service offering to relay your requests through a network of proxies in order to anonymize traffic. It certainly has its pitfalls, but that’s not the focus of our current article.

On a mac, you can install TOR via Homebrew

sudo brew install tor

You then start tor with the tor command. You should see a line that looks like this:

[notice] Opening Socks listener on 127.0.0.1:9050

This is the proxy you want to be using.  All you need to do is pass it in your casper call, along with the proxy type and you’re good!

casperjs --proxy=127.0.0.1:9050 --proxy-type=socks5 scraper-hermanos.coffee

There you have it! Have fun scraping the web and creating interesting data mashups. Make sure to verify the terms and services of the data sources you will be using!

A date with Jasmine

 

Working with AngularJS has been a great opportunity to really dive in Test Driven Development.  Testing using Karma (formerly Testacular) allows you to do two types of tests: End-to-end tests (e2e) using the Angular Scenario Runner and unit tests using Jasmine. Today’s article is about a little challenge I faced while running unit tests on an Angular controller. 

Jasmine

Jasmine is a behaviour-driven framework for testing javascript. It plays nicely with Karma, which also has a coffeescript preprocessor. What’s really fun with Jasmine is that it allows you to write your tests in a very natural language, which is very helpful in a test-driven / behaviour-driven approach. Think of it more along the lines of creating rough specifications for your software. Say, you’re writing down something like:

My Calculator

  • it should be able to add, so I expect adding 1 and 1 to give me 2.
  • it should be able subtract, so I expect subtracting 1 from 2 to give me 1.

In Jasmine, you’d end up with:

#Just a dummy calculator class
Calculator = 
    add: (num1, num2) -> num1 + num2
    sub: (num1, num2) -> num1 - num2

#Actual jasmine specs below

describe 'My Calculator', -> 

    it "should be able to add", -> 
        expect(Calculator.add(1, 1)).toEqual(2)

    it "should be able subtract", -> 
        expect(Calculator.sub(2, 1)).toEqual(1)

Here’s the JSFiddle in case you want to experiment.

Whenever I sit down to code, I find it really helpful to scribble down my specs to figure out exactly what I’m trying to accomplish. It’s a good mental exercise and if I use jasmine to do it, it gives me unit tests to boot!

Unit testing

Now that we know how to use Jasmine, we can do some unit testing! When you unit-test, you attempt to isolate small functionalities of your code and test them. That’s really easy when you’re dealing with code like our Calculator above, but when you’re interacting with multiple systems like a typical single-page app, things get more complicated. Thankfully, AngularJS is really helpful with mock objects that allow you to fake http requests, which will be the most common thing you’ll need to mock. Here’s what it looks like:

$httpBackend.whenGET('/api/users/1').respond(username: 'alex')

Calling external services is not your only problem however: there’s other contextual system calls you can make that also need to be mocked. Consider the following:

AgeCounter = 
    getAge: (birthDay) ->
        birthDay = new Date(birthDay)
        currentDate = new Date()
        #Naive implementation
        currentDate.getYear() - birthDay.getYear()

#Specs

describe "My naive age counter", -> 

    it "it should be able to get me the age", -> 
        expect(AgeCounter.getAge('1980-01-02')).toEqual(33)

Since it is using the current time to calculate the age of a person, you have to hard-code the expected value, which won’t be good anymore next year. Of course, you could do the following:

    it "it should be able to get me the age", -> 
        age = (new Date()).getYear() - (new Date('1980-01-02')).getYear()
        expect(AgeCounter.getAge('1980-01-02')).toEqual(age)

But then you’re replicating your AgeCounter logic in your test. What’s a codemonkey to do?

Trying out monkeypatching

My first thought was to try out monkeypatching Date. Monkeypatching consists in redefining or extending objects at runtime, which you can do with dynamic and prototype based languages like Javascript. While it’s  common to extend objects for mocking or to get extra functionalities, it is a little less so to redefine base objects like Date. It can, however, be very useful for testing. Behold:

fixedDate = new Date('1999-01-02');
Date = function () {
    return fixedDate;
};
today = new Date();
console.log(today);

And in our JSFiddle:

Screen Shot 2013-08-30 at 4.29.01 PM

Party like it’s 1999!

And in coffeescript

Screen Shot 2013-08-30 at 4.36.31 PM

Wait a minute, what’s going here? Let’s see what coffeescript compiles to:

// Generated by CoffeeScript 1.5.0
(function() {
  var Date, fixedDate, today;
  fixedDate = new Date('1999-01-02');
  Date = function() {
    return fixedDate;
  };
  today = new Date();
  console.log(today);
}).call(this);

Hold up, it creates a local Date object instead of overriding the native object? While I could have pressed on, I felt like taking a step back to look for a cleaner way to handle stubbing Date.

What about some spying?

Turns out that Jasmine has a really handy method called spyOn, which allows you to listen in on any function call. You can then either test that it’s been called, let it go through or return your own mocked data. It is often used to stub services such as http calls when you don’t have the fancy Angular mock objects, but it can also work with constructors of common javascript objects like Date (in that case, just bind yourself to the window object and listen for Date).

So here’s what it looks like:

AgeCounter = 
    getAge: (birthDay) ->
        birthDay = new Date(birthDay)
        currentDate = new Date()
        #Naive implementation
        currentDate.getYear() - birthDay.getYear()

#Specs

describe "My naive age counter", -> 

    it "it should be able to get me the age", ->
        oldDate = Date

        spyOn(window, 'Date').andCallFake (params)->
            if params? then new oldDate(params)
            else 
                new oldDate('1999-01-02')

        expect(AgeCounter.getAge('1980-01-02')).toEqual(19)

And here’s the fiddle in case you wanna test it.

Essentially, we listen in on the ‘Date’ method from the window object, and when it’s called, we substitute our own function. Our fake call then checks if there’s any parameter, in which case it calls the old Date constructor, otherwise it returns a fixed date.

Some notes

  • we have to save the reference to the old Date constructor, otherwise the code in our fake call would trigger our spy itself,
  • in retrospect, things would have been much simpler if I spied on Date.now().

So mock on, and if you’re new to Jasmine, don’t be afraid to try it out! Have some cool testing methods to share? Are you using a different framework? Leave a comment and we can talk about it!

building crypto-chrome at hackmtl

Last week was #hackmtl, a javascript/chrome-extension focused hackathon in Montreal. Having heard of it from #MTLStartupTalent, I immediately signed up. My friend @jpcaissy joined right after and we badgered Louis B. Varin (a fellow UQAMite) to tag along for his first hackathon. 

About #hackmtl

#hackmtl was the second edition of the event, organised by #MTLStartupTalent with the help of PasswordBox. The central themes of the day were Chrome Extensions, javascript and security. The hackathon had a duration of 9 hours (from 9 am to 6 pm on Saturday), but people were allowed to start their project Friday after announcing what they were gonna work on. Projects were to be webapps or extensions, built in teams of 4 max. Saturday’s location where most of the hacking would happen was at the very top floors of the iconic Olympic Stadium Tower in Montreal.

credit goes to @jpcaissy

credit goes to @jpcaissy

Epic.

Here’s an article that is more in depth about the event.

Brainstorming the idea

Teams got together on Friday to brainstorm ideas for projects we could develop during the hackathon. Since we already had our 3-man team made, we wasted no time in brainstorming ideas. Given the current state of affairs with the NSA, PRISM and email providers like LavaMail and SilentCircle closing their doors, we thought it would be useful to focus on web security and privacy. @jpcaissy thought it would be a very opportune time to review my past efforts with in-browser cryptography (which I had abandoned due to the state of crypto in javascript at the time) and perhaps build an extension to allow PGP right in GMail.  Good enough, we had some beers and we figured that’d be a good starting point for Saturday.

Setting up the project

Saturday morning, before we were allowed up the tower, I started setting up the project structure. Reasoning that a well structured project with a good build process would help us get good momentum and a healthy workflow, I set up the following:

Project architecture

Here’s what the project structure looks like on the filesystem:

project layout

project layout

 

We have separated our source files, our vendor files as well as the finished product files which are saved in the dist folder.

Chrome extension architecture

Chrome extensions can have multiple components such as background pages, content scripts, options pages , browser action and page actions.  In our case  we wanted to store our encryption core in the background page so that it would be available all the time, an option page to handle key management, a browser action to make a popup window to access encryption functionalities on the fly and finally multiple content scripts that are to be loaded depending on which page you are.

whiteboard doodle

doodle of crypto-chrome early architecture

Let’s dive in a little deeper in each part.

Background script

The background script is where most of the magic happens. It is a persistent (-ish if you use event scripts) piece of javascript code that lives in the background of your browser. We figured it would be the perfect place to put our crypto core, so that it can be accessed by our other components.

Content scripts

Content scripts are sandboxed Javascript files that are loaded when visiting matching page and that can interact with the DOM. They can’t communicate with the existing Javascript in the page, and they can’t directly contact the other extension component for security reasons. They can, however, send messages back to the Background page. In our case, we wanted to have a content script for GMail, that would be loaded whenever you visit the website, and would take care of fetching the contents of an email you’re reading for signature verification or decryption, and inject in the compose textarea for encryption. So we needed the content script to roughly have simple methods like:

  • getEmailText
  • injectEmailText

It would then make calls to the background script for encryption, decryption, signature and verification.

Options page

The options page is a standalone page that is typically used to configure your extension. It seemed like the proper area to put all the key management features. Key management and storage is all handled by the background script, but the interface itself was to be in an options page.

Browser action

Finally, browser actions are the little icons that live in the toolbar in the upper right corner of Chrome. They are buttons you can press at any time. For crypto-chrome, we wanted to have a little popup that is accessible at all times, with a simple text area allowing the user to take advantage of all the functionalities of the extension’s core, without depending on the content scripts.

Communicating together

As mentionned before, content scripts are sandboxed in their own little area and can’t communicate directly with other components, including our crypto-chrome core. The only way to communicate is by using the message passing through the chrome API, here’s the doc: http://developer.chrome.com/extensions/messaging.html So I built a little proof of concept to make sure things were working:

$ ->
  console.log "I'm in GMail, and I have JQuery ho ho ho."

  chrome.runtime.sendMessage {status: 'ready'}, (response) ->
    console.log "Server replied with response: #{response.message}"
chrome.runtime.onMessage.addListener (request, sender, sendResponse)->
  if request.status is 'ready'
    sendResponse {message: 'ready, gold leader!'}

Sweet! The GMail content script was able to contact the background, where all our awesome crypto would happen.

Source files

Chrome extensions are built only with HTML, Javascript and CSS. I’m however a big fan of languages with syntactic sugar and using technologies that favour a certain DRYness. In the context of a hackathon, I feel that development speed and programming in a friendly, expressive language is key. So I decided to use the following technologies:

  • coffeescript: generates javascript, has a much cleaner syntax,  whitespace sensitive
  • jade: generates html, cuts down on a lot of boilerplate and whitespace sensitive so no messing with closing tags,
  • stylus: similar to jade, but for CSS, also allows mixins and functions.

External dependencies

External dependencies would be managed when possible using Bower, and otherwise stashed in a vendor folder. Bower is a package manager for the web. It allows you to manage your front-end dependencies in a way that is very familiar to node developers, using a bower.json that is very similar to your usual package.json.

Tests

I installed Karma as a test runner and started drafting some tests. While I knew that my colleagues would probably not be too keen on TDD‘ing the whole project, it seemed like a good idea to put the structure in place and to write some preliminary tests that we could use as a generic roadmap of the functionalities we wanted to build during the hackathon. Won’t hurt the project to have tests post-hackathon anyway!

describe 'In the browser popup', ->

  it 'should allow you to encrypt a text area', ->
    message = 'This is a secret.'
    publicKey = 'Some Key'
    cipher = textEncrypt(message, publicKey)
    expect(cipher).toNotEqual(message)

  it 'should allow you to decrypt a text area', ->
    message = 'This is a secret.'
    publicKey = 'Some Key'
    privateKey = 'My Key'
    cipher = textEncrypt(message, publicKey)
    expect(cipher).toNotEqual(message)
    expect(textDecrypt(cipher, privateKey)).toEqual(message)

  it 'should allow you to sign a message and verify the signature', ->
    message = 'This is a secret.'
    publicKey = 'Some Key'
    privateKey = 'My Key'
    signature = textSign(message, privateKey)
    expect(signature).toBeTruthy()
    expect(verifySignature(signature, publicKey)).toBeTruthy()

Grunt

Finally, Grunt is the glue that keeps it all together. Grunt is a javascript task runner, similar to make. It has a lot of interesting modules to make your build easier. In my case, I installed the following modules:

    "grunt-contrib-jade": "~0.8.0",
    "grunt-contrib-coffee": "~0.7.0",
    "grunt-contrib-stylus": "~0.7.0",
    "grunt-contrib-watch": "~0.5.2",
    "grunt-contrib-copy": "~0.4.1",
    "grunt-contrib-concat": "~0.3.0",
    "grunt-bower-task": "~0.3.1",
    "grunt-karma": "~0.6.1",
    "grunt-contrib-clean": "~0.5.0"

So we can see, we have modules to compile our Jade, Coffeescript and Stylus to HTML, Javascript and CSS. We also have the copy and concat modules, allowing us to combine files and copy them over in the dist folder. Finally, we have a few utility modules to launch bower for web dependencies, run karma for our tests and clean up our project. Here’s some interesting segments of our Gruntfile The watch module allows you to watch certain filepaths for changes and to run tasks when changes are detected (just type grunt watch). In our case, we configured our src folders to launch compile tasks to manage our Coffeescript, Jade and Stylus. As soon as we edited files in our sources, the extension files would be updated in near real time.

    watch:
      coffee:
        files: 'src/**/*.coffee'
        tasks: 'coffee:compile'
      jade:
        files: 'src/**/*.jade'
        tasks: 'jade:html'
      stylus:
        files: 'src/**/*.styl'
        tasks: 'stylus:compile'

The list of grunt commands we exposed were the following:

 grunt.registerTask 'default', ['bower', 'compile', 'copy-resources']
 grunt.registerTask 'compile', ['coffee:compile', 'stylus:compile', 'jade:html']
 grunt.registerTask 'copy-resources',  ['concat', 'copy:img', 'copy:manifest']
 grunt.registerTask 'test', ['karma:unit']

Running “grunt” would fetch all dependencies and build the whole project, whereas typing just “grunt compile” would only compile our code and not slow us down with the dependencies if they were already there.

Git

Since we’d all be working together on the same project very fast, we’d need a source control system. Thankfully, we were all familiar with git, so I created the repository. The night before, to Louis’s insistance, we decided that we would make sure to use the –rebase flag after each pull in order to have a cleaner history.

Hackathon-day

At 9 am, we showed up at the base of the tower and we rode the lift to the top. After a bit of setup time, we pulled out our laptops and got ready to work. We quickly went over the project structure that I had set up in the morning and assigned general tasks for everyone: @jpcaissy would tackle the encryption core, Louis would start on the GMail content script and I would work on the extension pages.

Early issues

Some of the early hurdles we encountered were:

Node issues in Ubuntu

Turns out that if you were to run apt-get install node, you would make quite the mistake. The node package in ubuntu repos appears to be nodejs which quite an old version of it. I believe there’s a ppa that you can add to get the latest versions, but since time was of the essence, we got Louis to install his node through github source. Much better!

Unfamiliarity with Coffeescript

Not everyone is familiar with Coffeescript. Louis had no experience with it before, and @jpcaissy was a bit more at ease with javascript. No problem. Since we were using Grunt to build our extension, all we had to do is add a new javascript folder in our sources and let Louis work there, and simply copy the files over with a Grunt task. 5 minutes setup, and everyone was happy!

 Issues with Bower

The grunt bower task has been a bit wonky ever since the new release of Bower 1.0. It stopped working fairly early on but was still working using the globally installed bower.  It seems like bower was hanging and not letting the other build steps complete. We ended up just creating a build task to do everything else but the bower bit. Minor setback, but I’ll have to look into it later.

Giving up tests

Since we were not all comfortable with a TDD approach, we ended up dropping the tests for the duration of the hackathon, but we would agree to restore them later. While it allowed us to gain some speed, we’ve ran into some regression issues during the day which could have been mitigated by tests.

Change in direction

During the competition, we learned that another team was working on something very similar: forking an existing project, they were building a user-friendly way to integrate crypto into GMail. This made us pivot a bit on our idea and decide to shift the focus away from GMail and to refine a background API that can easily be used by content script. This would allow us to bring  privacy to practically any website possible, granted that someone contributes a content script. To build our proof of concept of the day, we got Louis to tackle the Outlook and Facebook content scripts.

Challenges and solutions

After that, the day went on at a frantic speed. Beers were had, and delicious meals were provided. Admittedly, eating ribs was a poor decision in the middle of a hackathon but they were so delicious that it was worth it. Here’s a sample of some technical challenges we bumped into and that we had to overcome:

Beating GMail’s auto-saving drafts

One issue that we first ran into was that GMail auto-save your drafts. That’s really useful normally, but I don’t think we would like a cleartext version of the text to be saved by google before encrypting the message! Our quick and dirty solution was to make a text area in the popup to allow users to write messages there, and click ‘insert into page’ (or just pressing the ‘Encrypt’ button) to inject it directly in GMail’s message box.

Not the best solution, but it worked in a pinch. Other solutions could be:

  • Overlaying a text area on top of the text area
  • Somehow blocking the draft save (not sure how to do this from a sandboxed content script)
  • Highjack the keyboard functionality directly (thanks @ramisayar)

Saving the keys

Another challenge was figuring out how to persist private and public keys. Local storage was the easiest client-side mechanism and we figured it would be reasonably secure if encrypted with the Stanford crypto library. All keys are saved in one encrypted bundle that must be decrypted for every  read/write access. We’ll add more customization for that later.

Preserving key integrity

Considering that Google Chrome is potentially an unsafe environment for a targeted attack, we considered mechanisms to preserve key integrity and help users detect when keys might have been tampered. The easiest way we found for a user to detect a change in a hash was simply to hash the entire key and use identicons from gravatar to generate a visual coloured pattern representing the key. This allows users to visually recognize when a change has happened.

Facebook formatting

Perhaps just an hour before pitching our extension to the judges, we realized that importing a ciphertext from Facebook messages and trying to decrypt it was failing, despite the engine doing a good job in GMail. After much fussing around, we realized that Facebook actually strips empty lines, thus breaking the ciphertext format which specified in RFC2440. That’s just one extra gotcha we’ll have to consider when sanitizing content script input.

Results!

9 hours, 70+ commits and we got an award for best team! We all worked like madmen and I’m really happy about the experience. We got to do something that was completely new to us (extensions), we tested out a work methodology and we ended up building something pretty darn cool.

Key management screen

Key management screen

popup

popup

And here’s the project on github.

Roadmap

We took a week to relax and celebrate a bit, but fear not, the adventure has just begun! Developing crypto-chrome and talking to others made us realize that there really is a need for it. With our current use of cloud services and the recent demands by the US government to those services to hand over private data, we really need to find solutions to protect our privacy. We believe that this solution cannot come from a new service provider: everyone can be corrupted, bribed or coerced into giving away their data and the best outcome in that situation is to do like Lavamail and SilentCircle and just close the service. Instead, we believe that democratizing client side encryption technologies using open source solutions that can be easily peer-reviewed is what we  need to do.

Please, go to our project page and contribute. Make a content script for your favourite website, test out the extension and send in issues, send us pull requests for improvements!

Here’s a sneak peak of what we’re planning in the near future before going officially in the chrome store:

  • finish signature verification,
  • better popups to ask for local storage access,
  • iron out bugs in facebook and gmail content script import/export features.

And of course a little bit later:

  • integrate with other websites (reddit, github, find a clever way to integrate twitter),
  • private key generation,
  • make a first-time user wizard.

 

I’m back!

It’s been a solid 2 years since I’ve updated my old blog, and even longer since I’ve added anything new to my bushibytes.com website. Things have changed a lot since then, and I thought a fresh start was in order, so here we are!

My name is Alexandre Rimthong (hence the odd domain name) and I’m an IT Consultant, a security enthusiast, a frequent hackathon participant, fledgling startup founder and a bit of a fitness nerd from Montreal, Canada. I love to tackle new talents, whether they are developping an app in 24 hours or running a half marathon in the mud. I figured I could use this blog to practice my writing and to share some stories and articles from my various experiences, successes and most importantly, failures!

More specifically, I’ll be sharing:

  • my experiences with developing with a team in NodeJS, Javascript, Coffeescript and AngularJS using test driven and agile methodologies,
  • my trials and tribulations in entrepreneurship with creating a lean startup,
  • the challenges faced while developing very rapidly all kinds of software during hackathon events,
  • my dabbling in security competitions,
  • my involvement in open-data initiatives,
  • and my new obsession with physical challenges like the Spartan Race and Triathlon.

Hopefully there will be some useful articles here and there that can help fellow devs and geeks out!