Skip to content

Scraping Coles Online

January 28, 2014

So over a year ago I wrote a scraper for scraping woolworths.

Today, I managed to scrape Coles online.

Everything is up on my github repo: https://github.com/johnjiang/trolley

I’ll try and keep this blog short.

First up, Coles did a great job revamping their online shopping experience. It’s actually navigatable!

The only challenge I faced scraping their website is the fact that they use ajax requests for processing the ‘Next’ page. To get around it, I opened up a phantomjs webdriver and used that to click into the links. This does mean processing is significantly slower but it means I can scrape the entire catalogue with ease.

The spider can be found at: https://github.com/johnjiang/trolley/blob/master/trolley/spiders/coles.py

Only 47 lines all up including new lines, imports, class definitions etc…

Just so you know this is a very rushed attempt. I didn’t bother cleaning up the data and only price and name of the product are scraped.

Have fun!

Advertisements

From → Uncategorized

One Comment
  1. Thanks for this. I tried it today – and after spidering through a few pages Coles website starts returning 404 errors for every page. Did you encounter this? I think it’s detecting too many requests from a single IP address in a short time as when I switch from my home WiFi to tethering off my phone it worked again but only briefly. Now I have to try and find a third IP address but with a random 30-60 sec sleep between each page. See if they will block it at that rate.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: