Scraping Coles Online
So over a year ago I wrote a scraper for scraping woolworths.
Today, I managed to scrape Coles online.
Everything is up on my github repo: https://github.com/johnjiang/trolley
I’ll try and keep this blog short.
First up, Coles did a great job revamping their online shopping experience. It’s actually navigatable!
The only challenge I faced scraping their website is the fact that they use ajax requests for processing the ‘Next’ page. To get around it, I opened up a phantomjs webdriver and used that to click into the links. This does mean processing is significantly slower but it means I can scrape the entire catalogue with ease.
The spider can be found at: https://github.com/johnjiang/trolley/blob/master/trolley/spiders/coles.py
Only 47 lines all up including new lines, imports, class definitions etc…
Just so you know this is a very rushed attempt. I didn’t bother cleaning up the data and only price and name of the product are scraped.
Have fun!
Thanks for this. I tried it today – and after spidering through a few pages Coles website starts returning 404 errors for every page. Did you encounter this? I think it’s detecting too many requests from a single IP address in a short time as when I switch from my home WiFi to tethering off my phone it worked again but only briefly. Now I have to try and find a third IP address but with a random 30-60 sec sleep between each page. See if they will block it at that rate.