Skip to content

Scraping Woolworths

December 23, 2012

So I decided to scrape the woolworths online site for fun. Turns out the entire process was extremely straightforward and easy.

I decided to use Scrapy as the framework mainly because I’ve heard a lot of good things about it but never actually used it. I’ve always used BeautifulSoup for everything.

What I usually do is find the starting page of any website that I want to scrape. This would usually be the first page.

So I started off at:

http://www2.woolworthsonline.com.au/Shop/Department/2?name=bakery

or more simply:

http://www2.woolworthsonline.com.au/Shop/Department/2

Hmm, that’s interesting, notice the 2 at the end of the url. If I change it to say…1 I get the baby department. This makes things simple.

That means we can change departments just by url hackery! (In my implementation I actually find the actual department IDs and don’t resort to guessing)

Looking at the source we can see that each product is inside a div with class “product-stamp-middle”. I’m more of a CSS selector type of guy and since there’s no CSS Selector parser in scrapy I’ve resorted to BeautifulSoup with lxml as the parser. This allowed me to fetch all the products with a simple selector.

products = soup.select('div.product-stamp-middle')

Name:
product.select('div.details-container span.description')[0].text.strip()

Price:
product.select('div.price-container span.price')[0].text.strip()

Cup Info (e.g. $1.32/100g, $1 each):
product.select('div.price-container div.cup-price')[0].text.strip()

All the information is in a separate div! Woolworths have made this way too easy.

The full code can be found on my github: https://github.com/johnjiang/trolley

When I tested it initially, it took around 3 minutes to parse 20k worth of items. Obviously this is dependent on internet connection and machine but not bad I thought! What makes it great is that woolworths online doesn’t do any throttling!

To run it just git clone it, cd into the directory and type this:

scrapy crawl woolies -o results.json -t json

It will dump everything into a json file and also display the time spent.

I learnt the following things doing this:

  • Scrapy offers an extremely good workflow for scraping
  • BeautifulSoup is still the winner in my books for parsing
  • Woolworths’ web devs are awesome

Coles on the other hand is terrible…

Advertisements

From → Uncategorized

6 Comments
  1. Mike permalink

    After cloning the project and running the command I get the following error 😦
    root@server:~/trolley# time scrapy crawl woolies -o items2.json -t json –nolog
    Usage
    =====
    scrapy crawl [options] …

    crawl: error: no such option: -o

    • I just tried and it’s working fine. Try just doing:

      scrapy crawl woolies -o results.json -t json

      Or try typing out the command instead of copy pasting.

      • Mike permalink

        Very weird. Just re-cloned the repo and typed out the command but I still received the same error. Is there anything extra that I may have to install?

      • Mike permalink

        Using the following command seems to have worked, which is weird since I’m using the most current version of scrapy that should support -o.
        –set FEED_URI=items.json –set FEED_FORMAT=json

      • Hmmm not sure. I tested on 0.20.2 which is the latest. Do a pip list | grep Scrapy to find the version.

Trackbacks & Pingbacks

  1. Scraping Coles Online | The AverageGrammer

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: