Skip to content

Scraping Coles Online

So over a year ago I wrote a scraper for scraping woolworths.

Today, I managed to scrape Coles online.

Everything is up on my github repo: https://github.com/johnjiang/trolley

I’ll try and keep this blog short.

First up, Coles did a great job revamping their online shopping experience. It’s actually navigatable!

The only challenge I faced scraping their website is the fact that they use ajax requests for processing the ‘Next’ page. To get around it, I opened up a phantomjs webdriver and used that to click into the links. This does mean processing is significantly slower but it means I can scrape the entire catalogue with ease.

The spider can be found at: https://github.com/johnjiang/trolley/blob/master/trolley/spiders/coles.py

Only 47 lines all up including new lines, imports, class definitions etc…

Just so you know this is a very rushed attempt. I didn’t bother cleaning up the data and only price and name of the product are scraped.

Have fun!

iCloud sucks

Conceptually I like iCloud. This magical service that devs utilise so that your data is stored safely and securely. You don’t really need to worry about how your data is stored, where it’s stored, as long as it is stored and you can access them.

My problem with iCloud is that it just doesn’t work in real life.

I can’t SEE my data. How do I access my files outside of the application? You can’t.

Here’s my anecdotal story of how iCloud is a piece of crap.

I use an app called 1SE. You take videos and store 1 second video clips against a particular day. I’ve been doing this for the past 8 months so as you can imagine this takes up quite a bit of space. But thankfully I have iCloud right? It backs it up! Great! Now I don’t have to worry about losing all my files!

But wait…what happens if I need to swap phones?

Using iCloud restore, 1SE just refused to restore…it was stuck on “loading”. After rebooting it was stuck on “waiting”. Great.

Being impatient, I downloaded iExplorer. Navigated tot he app directory and did a copy-pasta of the 1SE app directory. Thankfully that restored all my settings/videos just fine.

Now it makes me wonder, vast majority of users would have been FUCKED if this happened to them.

Instead of restoring the app AND the data. iOS should first restore the app and once opened allow users to select to restore from backup. Currently, if you delete your app accidentally and reinstall it…all your data is gone unless you restore from your entire phone. This is a serious design flaw.

I also thought about restoring from itunes backup but for some reason it didn’t restore any apps :S I remember there used to be a sync apps checkbox in the apps tab but couldn’t find it in the new itunes. Ah well…

Building a microserver

A couple of months ago I bought a HP Proliant Microserver N54L. I had contemplated buying the original model, N40L, when it was on sale for $200 two years ago but could never justify it. I had plenty of storage at the time. However, as time passed, my storage needs was getting out of hand. I had stored all my movies on the 2TB time capsule, 1.5TB external for a bunch of TV shows. Then there’s storing random stuff locally and/or on other networked computers.

It was a mess.

It also meant that my Apple TV running XBMC could not easily connect to the various shared drives.

So when HP announced the N54L and it was on sale for $300, I jumped at the opportunity.

N54L

N54L inside

 

After having it for around one month, I finally decided to buy the harddrives. I settled by 4x3TB WD Reds. I didn’t want to have to think about the potential for Greens to fail on me so I bit the bullet and paid the premium.

Also got myself 16gb of ram. You can find the model number in the pics below.

N54L and parts

 

16gb G-Skill Ram

Getting the mobo out was a bit tricky, it meant having to be a bit forceful with the cables. Eventually I got it out and plugged in my ram.

N54L mobo

 

The harddrives were really easy to install. The Microserver also comes with an internal usb port, so I bought a high speed USB and booted Freenas off it.

A pretty great setup if I say so myself.

So now my microserver hosts all my media files and does all the torrenting too!

It has an additional 5.25″ bay that I could potentially fit 2 more harddrives in. As of now, I still have 4.55TB remaining out of the usable 9TB (3TB used for parity) that should last me a few years I hope.

 

Citcon

So I attended a conference over the weekend. Citconf (www.citconf.com). It is the Continuous Integration and Testing framework. I found out about it through the SyPy mailing list. It was free, was at Atlassian and I thought I could get away with taking a couple of days off work.

Then I realised that it was on a Friday and Saturday so I thought “Ah well, one day off work is still good.”

So I opened up Outlook and decided to email my managers.

One of them replied with something along the lines of:

It’s on a Friday night and Saturday…and it’s free…so I would ask your girlfriend

I never checked the schedule and just assumed it was a full day on Friday.

Extremely embarrassed I didn’t even bother trying to dig myself out of that one so I just replied with a “Good point haha”.

I’ve actually never attended a conference before. I only just started attending meetups last year so I guess this year would be the time where I also started going to conferences. I didn’t know what to expect from citcon, my expectations weren’t that high.

It was an open space conference meaning that there was no agenda. The attendees come up with the agenda and people vote on what they want to listen to. Friday night mainly involved introducing one another, I should have told them my story above but I get nervous speaking to big crowds and the thought never entered my mind.

After introductions, people began putting up agendas on the whiteboard whilst others went into the kitchen for some food. They had this kinda peking duck like finger food but the pancake was really thick but soft and it was just delicious! I think I had 3.

I spent the rest of the night just socialising.

So far so good.

I rocked up Saturday morning and had a bit of breakfast at the conference. Instant coffee and banana bread haha.

The talks themselves weren’t exactly talks. Most of them were just discussions around a central topic or problem. It’s interesting because, everybody has similar problems but nobody has a solution to fix all problems. There’s no specific path to take just steps in the right direction.

In our first talk, people probably spent around 10 minutes just debating over the differences of “Delivery” and “Deployment”. Personally, I found it extremely entertaining, however, it really felt like they were just bike shedding. To me the definition if what you make of it.

I found this one woman’s method to be extremely interesting. So noted that she had sacked entire test teams for their ineptness. She commented that you only need one good tester per team to do a good job and that you’re not doing your job if you need to run the same manual test twice.

Her methodology was to get the test teams to write the acceptance tests in a DSL like cucumber before any actual code has been written. Testers will then manage the entire development process and will conduct the deployment themselves. This is similar to what we do at work actually, where my manager writes up the tests in concordion. However, I still manage the deployments.

I learnt a fair bit on what an “Acceptance test” is. Acceptance tests should not be exhaustive and focus on basic functionality. A separate job can be run to be more exhaustive. The tests must have a business case e.g. validating date format would not be an acceptance test, validating if you can withdraw a negative amount would be. Acceptance tests should not be about code coverage, these should be done via unit tests.

I also attended a Selenium best practices discussion. Basically don’t use xpath, use css selectors, make your tests atomic, don’t bother with htmlunit or phantomjs and test on VMs using selenium grid.

I might try converting some of the xpath selectors at work to start using css selectors. Hopefully, our tests will run slightly faster.

Overall, Citcon was an awesome experience. I learnt a fair bit about testing and its community. The people are truly great and passionate about what they’re doing. What I always imagined to be a boring and mundane roles has opened up my eyes to the value of testers. I might even purse a role in testing if I feel that I can’t be bothered being a code monkey.

Now to find the next conference to attend.

The Big Python Migration – Mysteries of HTML2.py

At work we’re currently migrating our webapp from webware to flask. As part of this migration, I’ll be documenting parts of my endeavor. This will be the first post of the series.

In our Python web app, we now officially have 3 ways to generate HTML code.

The original method, which was used before my time, involved crafting the html manually. The previous developer probably thought that this was a waste of time so he had made a bunch of helper methods to do this. Unfortunately, it was still terrible since you got stuff like:

def td(self, content):
   self.write("<td>")
   self.write(content)

Yes, it doesn’t even bother closing the tag. The actual method is longer but I shortened it for brevity.

So someone decided to make it “better”, and so HTML2.py was born.

Here’s what it looks like:

class TAG(object):
    def __init__(self, tag='TAG', contents=None, no_contents=False, **attributes):
        self.no_contents = no_contents
        contents = contents or []
        if not isseq(contents):
            contents=[contents]
        self.tag = tag
        self.contents = contents
        self.attributes = attributes
    def __add__(self, other):
        self.contents.append(other)
        return self
    def addAttrs(self, **attrs):
        self.attributes.update(attrs)
    def __str__(self):
        try:
            return ''.join(self.str_g())
        except:
            log.exception(str(type(self)) + ' ' + str(self.contents))
            if self.contents:
                log.exception(str(self.contents[0]))
            raise
    def str_g(self):
        yield '<' + self.tag
        for (a, v) in self.attr_g():
            yield ' ' + a + '=' + self._attr_quote + str2(v) + self._attr_quote
        if self.NO_CONTENTS:
            yield '/>'
        else :
            yield '>'
            for c in self.contents:
                if isinstance(c, TAG):
                    for i in c.str_g():
                        yield i
                else:
                    yield str2(c)
            yield '</' + self.tag + '>'
    def attr_g(self):
        for a, v in self.attributes.iteritems():
            if isseq(v):
                for _v in v:
                    yield a, _v
            else:
                yield a, v

Again…shortened for brevity.

So basically you have an object that represent every tag:

class SPAN(TAG):
    def __init__(self, contents=None, **attributes):
        TAG.__init__(self, 'SPAN', contents, **attributes)
class DIV(TAG):
    def __init__(self, contents=None, **attributes):
        TAG.__init__(self, 'DIV', contents, **attributes)
class PRE(TAG):
    def __init__(self, contents=None, **attributes):
        TAG.__init__(self, 'PRE', contents, **attributes)
class A(TAG):
    def __init__(self, contents=None, **attributes):
        TAG.__init__(self, 'A', contents, **attributes)
class P(TAG):
    def __init__(self, contents=None, **attributes):
        TAG.__init__(self, 'P', contents, **attributes)

Here’s some code as to demonstrate how you would construct a table.

table = TABLE()
tr = TR()
tr += TD("Some value")
tr += TD("Some value")
table += tr()

Considering that the folks at web2py does a “similar” job, I guess this design isn’t THAT bad. But our underlying implementation is surely terrible.

I’ve gotten used to this implementation, and since it’s a blackbox of sorts, it never occurred to me ever fix it. However, recently I was working on a page that for some reason took significantly longer to load.

This page took a whole 10 seconds to load on my machine. The weird thing was that this page did not have any complicated business logic nor was it doing any intensive queries.

I isolated the code to one specific function which involved rendering drop downs. The drop downs themselves actually contained around 5000 elements. Hmmm. I added some code around this to time how long it took and I ran the report.

1.6 seconds.

1.6 seconds to generate a drop down with 5000 elements.

The page had 3 of these elements so that’s at least 4.8 seconds.

This is where I decided to try and optimise HTML2.py. I should have done it a long time ago but for whatever reason I chose not to.

The first thing I thought to do was use lxml. I had actually used lxml to replace couple of our XML related APIs. It managed to turn a 35 second request into a mere 12 second request, but that’s a different story. So I thought “I’ll just use lxml then”. After spending 10 minutes on it, it was looking hopeful, but immediately I could see problems.

I was constructing etree elements within the to_string method. So for ever tag element I was duplicating it with an etree element. I’ll definitely need to fix this. Another problem was that lxml escapes html automatically and there’s no way to NOT escape the html. Unfortunately, due to the way our web app was designed, NOT escaping html is a feature…not a vulnerability. So I quickly reverted my code.

Being the lazy programmer that I am, and also the fact that I was incredibly hungry at the time. I posted up a stackoverflow question in hope that somebody can help me come up with a solution to this problem.

I came back from lunch. Refreshed the page and was surprised to see two solutions.

The first solution that I read happened to use lxml as well and looked quite similar to my solution, however, since I knew this wasn’t going to work out I read the next solution.

It was such an elegant solution that I couldn’t believe why I never came up with it myself. It was simple and obvious. I think it was one of those situations where I’ve been looking at it for so long that over time I’ve become accustomed to the style. Or perhaps I’m just creating excuses for my own ineptness.

So after implementing it and the code became this:

class TAG(object):
    __slots__ = ["contents", "attributes"]
    tag = "TAG"

    def write(self, writer=Writer()):
        writer.write(self.to_string())
    def __init__(self, contents=None, **attributes):
        contents = contents or []
        if not isseq(contents):
            contents = [contents]
        self.contents = contents
        self.attributes = attributes
    def __add__(self, other):
        self.contents.append(other)
        return self
    def addAttrs(self, **attrs):
        self.attributes.update(attrs)
    def __unicode__(self):
        return self.to_string()
    def to_string(self):
        return """<{tag}{attributes}>{content}</{tag}>""".format(
            tag=self.tag,
            attributes=''.join(' %s="%s"' % (attr, _unicode(val)) for attr, val in self.attr_g()),
            content=''.join(_unicode(n) for n in self.contents)
        )
    def attr_g(self):
        for a, v in self.attributes.iteritems():
            if isseq(v):
                for _v in v:
                    yield a, _v
            else:
                yield a, v

I ran my adhoc benchmark again. 0.6 seconds.

0.6 seconds to generate the same drop down box with 5000 elements.

60% performance improvement.

I timed the entire page and it took around 4 seconds. I couldn’t believe my eyes. Yes, I know 4 seconds is still incredibly terrible…compared to what it was before it’s still a reason for celebration.

So. Fast forward 1 month and I’m writing this blog. I was going to write up a more scientific benchmark, however, something screwed up. Well…I screwed up more precisely.

Here’s my simple benchmark

import time
from Lib.HTML.HTML2 import *

start_time = time.time()

table = TABLE()
for _ in xrange(100000):
    table += TR(TD("TEST", some_attribute="foo"))

html_str = str(table)
print time.time() - start_time

I ran it against the original code. 8.84 seconds. So I thought…I should be getting around 4 seconds with the new version. Can you guess what happened?

10.24 seconds

HOW DID THIS HAPPEN?!

HOW DID I GET THE ORIGINAL BENCHMARK?

I was extremely worried so I began analysing everything again. I compared the outputted html from both version and they matched character for character.

I even pulled up an older version of our app and tested it again and for some mysterious reason it was performing just as well after the changes. I felt somewhat defeated. I had spent time working on something that provided arguably very little value. The fact that the code is A LOT more readable may not be enough to convince people that it’s a worthwhile change. It would have been too late to change it back anyway but I was sad knowing that my code had no performance benefits.

Not giving up I commented out pieces of code to find out if there’s any inefficiencies that may hint at where I went wrong. What I found out was that when I commented out:

if isseq(v):
    for _v in v:
        yield a, _v
    else:

My code ran at 6.7 seconds! I repeated this test over and over again and my eyes did not fool me. So I delved deeper.

I found this

def isseq(i):
    if not '__contains__' in dir(i):
        return False

    if isinstance(i, basestring) or isinstance(i, dict):
        return False

    return True

Do you see it?

I facepalmed. It was checking to see if the object “i” responded to the message “__contains__” however it was doing it in an extremely inefficient manner. Quoting the official documentation for the “dir” function:

Without arguments, return the list of names in the current local scope. With an argument, attempt to return a list of valid attributes for that object.

I changed this to:

def isseq(i):
    if not hasattr(i, '__contains__'):
        return False

    if isinstance(i, basestring) or isinstance(i, dict):
        return False

    return True

and ran my test again.

2.4 seconds

Holy mother of crap. isseq was being called on object creation and also when generating the html. This process took up the vast majority of processing. I had done some Apache benchmark tests and on one particular page it would take 37.89 seconds to process 500 requests with 16 threads. After the change it takes just 8.007 seconds. A speed up of nearly 5x just by optimising ONE line of code.

How about that page that was taking 4 seconds? 0.45 seconds.

So in the end, whilst the main aim of the code change did little, the end result was quite drastic simply due to me exploring around in the codebase. What annoys me is that if I had noticed this earlier, our system would have behaved a lot faster a lot earlier.

I still have no idea how I managed to get those initial benchmarks…but I swear to god I was not dreaming or making those numbers up.

Well this is it for now. I’ll probably be posting up another series on the actual Flask migration and discuss further peculiarities of our system.

Setting up Selenium Grid on Windows 2008 as a service

Doing this was such a pain in the ass that I thought I should blog about it just in case somebody else on the internet runs into the same situation.

Doing google searches yields these results:

etc…

But most of the solutions don’t really offer a step by step solution, they only provide you with the tool. I found these tools confusing or maybe I’m just too stupid so I gave up using them. Instead I came up with my own solution.

Our server is running windows 2008 and in the past I had used srvany.exe as part of the Windows Resource Kit to manage services. However, this kit is only limited to windows 2003. There doesn’t seem to be an equivalent kit for 2008 (as of this post).

But after googling around I discovered that all you had to do is copy the exe from windows resource kit and it should work.

So that’s exactly what I did!

Here is a step by step guide on how to setup Selenium Grid as a Windows 2008 service.

  1. Download Selenium Server (http://seleniumhq.org/download/)
  2. Fetch srvany.exe from a 2003 server with the resource kit installed or try and find it online.
  3. For my setup I moved everything to D:\Selenium but you can use another directory
  4. Open up cmd
  5. Type “sc create SeleniumHub binPath= “D:\Selenium\srvany.exe”  to create the service
  6. Open up regedit
  7. Navigate to “HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\SeleniumHub”
  8. Create the following keys/strings
    regedit
  9. Start the service!

If you want to setup a Selenium Grid node do the same thing.

Scraping Woolworths

So I decided to scrape the woolworths online site for fun. Turns out the entire process was extremely straightforward and easy.

I decided to use Scrapy as the framework mainly because I’ve heard a lot of good things about it but never actually used it. I’ve always used BeautifulSoup for everything.

What I usually do is find the starting page of any website that I want to scrape. This would usually be the first page.

So I started off at:

http://www2.woolworthsonline.com.au/Shop/Department/2?name=bakery

or more simply:

http://www2.woolworthsonline.com.au/Shop/Department/2

Hmm, that’s interesting, notice the 2 at the end of the url. If I change it to say…1 I get the baby department. This makes things simple.

That means we can change departments just by url hackery! (In my implementation I actually find the actual department IDs and don’t resort to guessing)

Looking at the source we can see that each product is inside a div with class “product-stamp-middle”. I’m more of a CSS selector type of guy and since there’s no CSS Selector parser in scrapy I’ve resorted to BeautifulSoup with lxml as the parser. This allowed me to fetch all the products with a simple selector.

products = soup.select('div.product-stamp-middle')

Name:
product.select('div.details-container span.description')[0].text.strip()

Price:
product.select('div.price-container span.price')[0].text.strip()

Cup Info (e.g. $1.32/100g, $1 each):
product.select('div.price-container div.cup-price')[0].text.strip()

All the information is in a separate div! Woolworths have made this way too easy.

The full code can be found on my github: https://github.com/johnjiang/trolley

When I tested it initially, it took around 3 minutes to parse 20k worth of items. Obviously this is dependent on internet connection and machine but not bad I thought! What makes it great is that woolworths online doesn’t do any throttling!

To run it just git clone it, cd into the directory and type this:

scrapy crawl woolies -o results.json -t json

It will dump everything into a json file and also display the time spent.

I learnt the following things doing this:

  • Scrapy offers an extremely good workflow for scraping
  • BeautifulSoup is still the winner in my books for parsing
  • Woolworths’ web devs are awesome

Coles on the other hand is terrible…