Scraping Coles Online

January 28, 2014

So over a year ago I wrote a scraper for scraping woolworths.

Today, I managed to scrape Coles online.

Everything is up on my github repo:

I’ll try and keep this blog short.

First up, Coles did a great job revamping their online shopping experience. It’s actually navigatable!

The only challenge I faced scraping their website is the fact that they use ajax requests for processing the ‘Next’ page. To get around it, I opened up a phantomjs webdriver and used that to click into the links. This does mean processing is significantly slower but it means I can scrape the entire catalogue with ease.

The spider can be found at:

Only 47 lines all up including new lines, imports, class definitions etc…

Just so you know this is a very rushed attempt. I didn’t bother cleaning up the data and only price and name of the product are scraped.

Have fun!

One Comment
  1. Thanks for this. I tried it today – and after spidering through a few pages Coles website starts returning 404 errors for every page. Did you encounter this? I think it’s detecting too many requests from a single IP address in a short time as when I switch from my home WiFi to tethering off my phone it worked again but only briefly. Now I have to try and find a third IP address but with a random 30-60 sec sleep between each page. See if they will block it at that rate.

