The AverageGrammer

January 28, 2014

Scraping Coles Online

So over a year ago I wrote a scraper for scraping woolworths.

Today, I managed to scrape Coles online.

Everything is up on my github repo: https://github.com/johnjiang/trolley

I’ll try and keep this blog short.

First up, Coles did a great job revamping their online shopping experience. It’s actually navigatable!

The only challenge I faced scraping their website is the fact that they use ajax requests for processing the ‘Next’ page. To get around it, I opened up a phantomjs webdriver and used that to click into the links. This does mean processing is significantly slower but it means I can scrape the entire catalogue with ease.

The spider can be found at: https://github.com/johnjiang/trolley/blob/master/trolley/spiders/coles.py

Only 47 lines all up including new lines, imports, class definitions etc…

Just so you know this is a very rushed attempt. I didn’t bother cleaning up the data and only price and name of the product are scraped.

Have fun!

October 2, 2013

iCloud sucks

Conceptually I like iCloud. This magical service that devs utilise so that your data is stored safely and securely. You don’t really need to worry about how your data is stored, where it’s stored, as long as it is stored and you can access them.

My problem with iCloud is that it just doesn’t work in real life.

I can’t SEE my data. How do I access my files outside of the application? You can’t.

Here’s my anecdotal story of how iCloud is a piece of crap.

I use an app called 1SE. You take videos and store 1 second video clips against a particular day. I’ve been doing this for the past 8 months so as you can imagine this takes up quite a bit of space. But thankfully I have iCloud right? It backs it up! Great! Now I don’t have to worry about losing all my files!

But wait…what happens if I need to swap phones?

Using iCloud restore, 1SE just refused to restore…it was stuck on “loading”. After rebooting it was stuck on “waiting”. Great.

Being impatient, I downloaded iExplorer. Navigated tot he app directory and did a copy-pasta of the 1SE app directory. Thankfully that restored all my settings/videos just fine.

Now it makes me wonder, vast majority of users would have been FUCKED if this happened to them.

Instead of restoring the app AND the data. iOS should first restore the app and once opened allow users to select to restore from backup. Currently, if you delete your app accidentally and reinstall it…all your data is gone unless you restore from your entire phone. This is a serious design flaw.

I also thought about restoring from itunes backup but for some reason it didn’t restore any apps :S I remember there used to be a sync apps checkbox in the apps tab but couldn’t find it in the new itunes. Ah well…

April 7, 2013

Building a microserver

A couple of months ago I bought a HP Proliant Microserver N54L. I had contemplated buying the original model, N40L, when it was on sale for $200 two years ago but could never justify it. I had plenty of storage at the time. However, as time passed, my storage needs was getting out of hand. I had stored all my movies on the 2TB time capsule, 1.5TB external for a bunch of TV shows. Then there’s storing random stuff locally and/or on other networked computers.

It was a mess.

It also meant that my Apple TV running XBMC could not easily connect to the various shared drives.

So when HP announced the N54L and it was on sale for $300, I jumped at the opportunity.

After having it for around one month, I finally decided to buy the harddrives. I settled by 4x3TB WD Reds. I didn’t want to have to think about the potential for Greens to fail on me so I bit the bullet and paid the premium.

Also got myself 16gb of ram. You can find the model number in the pics below.

Getting the mobo out was a bit tricky, it meant having to be a bit forceful with the cables. Eventually I got it out and plugged in my ram.

The harddrives were really easy to install. The Microserver also comes with an internal usb port, so I bought a high speed USB and booted Freenas off it.

A pretty great setup if I say so myself.

So now my microserver hosts all my media files and does all the torrenting too!

It has an additional 5.25″ bay that I could potentially fit 2 more harddrives in. As of now, I still have 4.55TB remaining out of the usable 9TB (3TB used for parity) that should last me a few years I hope.

February 10, 2013

Citcon

So I attended a conference over the weekend. Citconf (www.citconf.com). It is the Continuous Integration and Testing framework. I found out about it through the SyPy mailing list. It was free, was at Atlassian and I thought I could get away with taking a couple of days off work.

Then I realised that it was on a Friday and Saturday so I thought “Ah well, one day off work is still good.”

So I opened up Outlook and decided to email my managers.

One of them replied with something along the lines of:

It’s on a Friday night and Saturday…and it’s free…so I would ask your girlfriend

I never checked the schedule and just assumed it was a full day on Friday.

Extremely embarrassed I didn’t even bother trying to dig myself out of that one so I just replied with a “Good point haha”.

I’ve actually never attended a conference before. I only just started attending meetups last year so I guess this year would be the time where I also started going to conferences. I didn’t know what to expect from citcon, my expectations weren’t that high.

It was an open space conference meaning that there was no agenda. The attendees come up with the agenda and people vote on what they want to listen to. Friday night mainly involved introducing one another, I should have told them my story above but I get nervous speaking to big crowds and the thought never entered my mind.

After introductions, people began putting up agendas on the whiteboard whilst others went into the kitchen for some food. They had this kinda peking duck like finger food but the pancake was really thick but soft and it was just delicious! I think I had 3.

I spent the rest of the night just socialising.

So far so good.

I rocked up Saturday morning and had a bit of breakfast at the conference. Instant coffee and banana bread haha.

The talks themselves weren’t exactly talks. Most of them were just discussions around a central topic or problem. It’s interesting because, everybody has similar problems but nobody has a solution to fix all problems. There’s no specific path to take just steps in the right direction.

In our first talk, people probably spent around 10 minutes just debating over the differences of “Delivery” and “Deployment”. Personally, I found it extremely entertaining, however, it really felt like they were just bike shedding. To me the definition if what you make of it.

I found this one woman’s method to be extremely interesting. So noted that she had sacked entire test teams for their ineptness. She commented that you only need one good tester per team to do a good job and that you’re not doing your job if you need to run the same manual test twice.

Her methodology was to get the test teams to write the acceptance tests in a DSL like cucumber before any actual code has been written. Testers will then manage the entire development process and will conduct the deployment themselves. This is similar to what we do at work actually, where my manager writes up the tests in concordion. However, I still manage the deployments.

I learnt a fair bit on what an “Acceptance test” is. Acceptance tests should not be exhaustive and focus on basic functionality. A separate job can be run to be more exhaustive. The tests must have a business case e.g. validating date format would not be an acceptance test, validating if you can withdraw a negative amount would be. Acceptance tests should not be about code coverage, these should be done via unit tests.

I also attended a Selenium best practices discussion. Basically don’t use xpath, use css selectors, make your tests atomic, don’t bother with htmlunit or phantomjs and test on VMs using selenium grid.

I might try converting some of the xpath selectors at work to start using css selectors. Hopefully, our tests will run slightly faster.

Overall, Citcon was an awesome experience. I learnt a fair bit about testing and its community. The people are truly great and passionate about what they’re doing. What I always imagined to be a boring and mundane roles has opened up my eyes to the value of testers. I might even purse a role in testing if I feel that I can’t be bothered being a code monkey.

Now to find the next conference to attend.

February 8, 2013

The Big Python Migration – Mysteries of HTML2.py

At work we’re currently migrating our webapp from webware to flask. As part of this migration, I’ll be documenting parts of my endeavor. This will be the first post of the series.

In our Python web app, we now officially have 3 ways to generate HTML code.

The original method, which was used before my time, involved crafting the html manually. The previous developer probably thought that this was a waste of time so he had made a bunch of helper methods to do this. Unfortunately, it was still terrible since you got stuff like:

def td(self, content):
   self.write("<td>")
   self.write(content)

Yes, it doesn’t even bother closing the tag. The actual method is longer but I shortened it for brevity.

So someone decided to make it “better”, and so HTML2.py was born.

Here’s what it looks like:

class TAG(object):
    def __init__(self, tag='TAG', contents=None, no_contents=False, **attributes):
        self.no_contents = no_contents
        contents = contents or []
        if not isseq(contents):
            contents=[contents]
        self.tag = tag
        self.contents = contents
        self.attributes = attributes
    def __add__(self, other):
        self.contents.append(other)
        return self
    def addAttrs(self, **attrs):
        self.attributes.update(attrs)
    def __str__(self):
        try:
            return ''.join(self.str_g())
        except:
            log.exception(str(type(self)) + ' ' + str(self.contents))
            if self.contents:
                log.exception(str(self.contents[0]))
            raise
    def str_g(self):
        yield '<' + self.tag
        for (a, v) in self.attr_g():
            yield ' ' + a + '=' + self._attr_quote + str2(v) + self._attr_quote
        if self.NO_CONTENTS:
            yield '/>'
        else :
            yield '>'
            for c in self.contents:
                if isinstance(c, TAG):
                    for i in c.str_g():
                        yield i
                else:
                    yield str2(c)
            yield '</' + self.tag + '>'
    def attr_g(self):
        for a, v in self.attributes.iteritems():
            if isseq(v):
                for _v in v:
                    yield a, _v
            else:
                yield a, v

Again…shortened for brevity.

So basically you have an object that represent every tag:

class SPAN(TAG):
    def __init__(self, contents=None, **attributes):
        TAG.__init__(self, 'SPAN', contents, **attributes)
class DIV(TAG):
    def __init__(self, contents=None, **attributes):
        TAG.__init__(self, 'DIV', contents, **attributes)
class PRE(TAG):
    def __init__(self, contents=None, **attributes):
        TAG.__init__(self, 'PRE', contents, **attributes)
class A(TAG):
    def __init__(self, contents=None, **attributes):
        TAG.__init__(self, 'A', contents, **attributes)
class P(TAG):
    def __init__(self, contents=None, **attributes):
        TAG.__init__(self, 'P', contents, **attributes)

Here’s some code as to demonstrate how you would construct a table.

table = TABLE()
tr = TR()
tr += TD("Some value")
tr += TD("Some value")
table += tr()

Considering that the folks at web2py does a “similar” job, I guess this design isn’t THAT bad. But our underlying implementation is surely terrible.

I’ve gotten used to this implementation, and since it’s a blackbox of sorts, it never occurred to me ever fix it. However, recently I was working on a page that for some reason took significantly longer to load.

This page took a whole 10 seconds to load on my machine. The weird thing was that this page did not have any complicated business logic nor was it doing any intensive queries.

I isolated the code to one specific function which involved rendering drop downs. The drop downs themselves actually contained around 5000 elements. Hmmm. I added some code around this to time how long it took and I ran the report.

1.6 seconds.

1.6 seconds to generate a drop down with 5000 elements.

The page had 3 of these elements so that’s at least 4.8 seconds.

This is where I decided to try and optimise HTML2.py. I should have done it a long time ago but for whatever reason I chose not to.

The first thing I thought to do was use lxml. I had actually used lxml to replace couple of our XML related APIs. It managed to turn a 35 second request into a mere 12 second request, but that’s a different story. So I thought “I’ll just use lxml then”. After spending 10 minutes on it, it was looking hopeful, but immediately I could see problems.

I was constructing etree elements within the to_string method. So for ever tag element I was duplicating it with an etree element. I’ll definitely need to fix this. Another problem was that lxml escapes html automatically and there’s no way to NOT escape the html. Unfortunately, due to the way our web app was designed, NOT escaping html is a feature…not a vulnerability. So I quickly reverted my code.

Being the lazy programmer that I am, and also the fact that I was incredibly hungry at the time. I posted up a stackoverflow question in hope that somebody can help me come up with a solution to this problem.

I came back from lunch. Refreshed the page and was surprised to see two solutions.

The first solution that I read happened to use lxml as well and looked quite similar to my solution, however, since I knew this wasn’t going to work out I read the next solution.

It was such an elegant solution that I couldn’t believe why I never came up with it myself. It was simple and obvious. I think it was one of those situations where I’ve been looking at it for so long that over time I’ve become accustomed to the style. Or perhaps I’m just creating excuses for my own ineptness.

So after implementing it and the code became this:

class TAG(object):
    __slots__ = ["contents", "attributes"]
    tag = "TAG"

    def write(self, writer=Writer()):
        writer.write(self.to_string())
    def __init__(self, contents=None, **attributes):
        contents = contents or []
        if not isseq(contents):
            contents = [contents]
        self.contents = contents
        self.attributes = attributes
    def __add__(self, other):
        self.contents.append(other)
        return self
    def addAttrs(self, **attrs):
        self.attributes.update(attrs)
    def __unicode__(self):
        return self.to_string()
    def to_string(self):
        return """<{tag}{attributes}>{content}</{tag}>""".format(
            tag=self.tag,
            attributes=''.join(' %s="%s"' % (attr, _unicode(val)) for attr, val in self.attr_g()),
            content=''.join(_unicode(n) for n in self.contents)
        )
    def attr_g(self):
        for a, v in self.attributes.iteritems():
            if isseq(v):
                for _v in v:
                    yield a, _v
            else:
                yield a, v

I ran my adhoc benchmark again. 0.6 seconds.

0.6 seconds to generate the same drop down box with 5000 elements.

60% performance improvement.

I timed the entire page and it took around 4 seconds. I couldn’t believe my eyes. Yes, I know 4 seconds is still incredibly terrible…compared to what it was before it’s still a reason for celebration.

So. Fast forward 1 month and I’m writing this blog. I was going to write up a more scientific benchmark, however, something screwed up. Well…I screwed up more precisely.

Here’s my simple benchmark

import time
from Lib.HTML.HTML2 import *

start_time = time.time()

table = TABLE()
for _ in xrange(100000):
    table += TR(TD("TEST", some_attribute="foo"))

html_str = str(table)
print time.time() - start_time

I ran it against the original code. 8.84 seconds. So I thought…I should be getting around 4 seconds with the new version. Can you guess what happened?

10.24 seconds

HOW DID THIS HAPPEN?!

HOW DID I GET THE ORIGINAL BENCHMARK?

I was extremely worried so I began analysing everything again. I compared the outputted html from both version and they matched character for character.

I even pulled up an older version of our app and tested it again and for some mysterious reason it was performing just as well after the changes. I felt somewhat defeated. I had spent time working on something that provided arguably very little value. The fact that the code is A LOT more readable may not be enough to convince people that it’s a worthwhile change. It would have been too late to change it back anyway but I was sad knowing that my code had no performance benefits.

Not giving up I commented out pieces of code to find out if there’s any inefficiencies that may hint at where I went wrong. What I found out was that when I commented out:

if isseq(v):
    for _v in v:
        yield a, _v
    else:

My code ran at 6.7 seconds! I repeated this test over and over again and my eyes did not fool me. So I delved deeper.

I found this

def isseq(i):
    if not '__contains__' in dir(i):
        return False

    if isinstance(i, basestring) or isinstance(i, dict):
        return False

    return True

Do you see it?

I facepalmed. It was checking to see if the object “i” responded to the message “__contains__” however it was doing it in an extremely inefficient manner. Quoting the official documentation for the “dir” function:

Without arguments, return the list of names in the current local scope. With an argument, attempt to return a list of valid attributes for that object.

I changed this to:

def isseq(i):
    if not hasattr(i, '__contains__'):
        return False

    if isinstance(i, basestring) or isinstance(i, dict):
        return False

    return True

and ran my test again.

2.4 seconds

Holy mother of crap. isseq was being called on object creation and also when generating the html. This process took up the vast majority of processing. I had done some Apache benchmark tests and on one particular page it would take 37.89 seconds to process 500 requests with 16 threads. After the change it takes just 8.007 seconds. A speed up of nearly 5x just by optimising ONE line of code.

How about that page that was taking 4 seconds? 0.45 seconds.

So in the end, whilst the main aim of the code change did little, the end result was quite drastic simply due to me exploring around in the codebase. What annoys me is that if I had noticed this earlier, our system would have behaved a lot faster a lot earlier.

I still have no idea how I managed to get those initial benchmarks…but I swear to god I was not dreaming or making those numbers up.

Well this is it for now. I’ll probably be posting up another series on the actual Flask migration and discuss further peculiarities of our system.

February 7, 2013

Setting up Selenium Grid on Windows 2008 as a service

Doing this was such a pain in the ass that I thought I should blog about it just in case somebody else on the internet runs into the same situation.

Doing google searches yields these results:

etc…

But most of the solutions don’t really offer a step by step solution, they only provide you with the tool. I found these tools confusing or maybe I’m just too stupid so I gave up using them. Instead I came up with my own solution.

Our server is running windows 2008 and in the past I had used srvany.exe as part of the Windows Resource Kit to manage services. However, this kit is only limited to windows 2003. There doesn’t seem to be an equivalent kit for 2008 (as of this post).

But after googling around I discovered that all you had to do is copy the exe from windows resource kit and it should work.

So that’s exactly what I did!

Here is a step by step guide on how to setup Selenium Grid as a Windows 2008 service.

Download Selenium Server (http://seleniumhq.org/download/)
Fetch srvany.exe from a 2003 server with the resource kit installed or try and find it online.
For my setup I moved everything to D:\Selenium but you can use another directory
Open up cmd
Type “sc create SeleniumHub binPath= “D:\Selenium\srvany.exe” to create the service
Open up regedit
Navigate to “HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\SeleniumHub”
Create the following keys/strings
Start the service!

If you want to setup a Selenium Grid node do the same thing.

December 23, 2012

Scraping Woolworths

So I decided to scrape the woolworths online site for fun. Turns out the entire process was extremely straightforward and easy.

I decided to use Scrapy as the framework mainly because I’ve heard a lot of good things about it but never actually used it. I’ve always used BeautifulSoup for everything.

What I usually do is find the starting page of any website that I want to scrape. This would usually be the first page.

So I started off at:

http://www2.woolworthsonline.com.au/Shop/Department/2?name=bakery

or more simply:

http://www2.woolworthsonline.com.au/Shop/Department/2

Hmm, that’s interesting, notice the 2 at the end of the url. If I change it to say…1 I get the baby department. This makes things simple.

That means we can change departments just by url hackery! (In my implementation I actually find the actual department IDs and don’t resort to guessing)

Looking at the source we can see that each product is inside a div with class “product-stamp-middle”. I’m more of a CSS selector type of guy and since there’s no CSS Selector parser in scrapy I’ve resorted to BeautifulSoup with lxml as the parser. This allowed me to fetch all the products with a simple selector.

products = soup.select('div.product-stamp-middle')

Name:
product.select('div.details-container span.description')[0].text.strip()

Price:
product.select('div.price-container span.price')[0].text.strip()

Cup Info (e.g. $1.32/100g, $1 each):
product.select('div.price-container div.cup-price')[0].text.strip()

All the information is in a separate div! Woolworths have made this way too easy.

The full code can be found on my github: https://github.com/johnjiang/trolley

When I tested it initially, it took around 3 minutes to parse 20k worth of items. Obviously this is dependent on internet connection and machine but not bad I thought! What makes it great is that woolworths online doesn’t do any throttling!

To run it just git clone it, cd into the directory and type this:

scrapy crawl woolies -o results.json -t json

It will dump everything into a json file and also display the time spent.

I learnt the following things doing this:

Scrapy offers an extremely good workflow for scraping
BeautifulSoup is still the winner in my books for parsing
Woolworths’ web devs are awesome

Coles on the other hand is terrible…

November 22, 2009

Tutorial: Boot Ubuntu 9.10 Partition using Virtualbox inside Windows (deprecated)

This article is now deprecated as 10.04 has been released. Comments will be disabled. Check back for an updated version.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

So Ubuntu 9.10 got released several weeks ago and people have been asking about how to get my old tutorial working for 9.10.

The problem is that Ubuntu 9.10 uses the new grub 2 boot loader which changes LOTS of things. In this tutorial I will be showing you how to get Ubuntu 9.10 (or any other linux OS with grub2) working under your Windows installation.

Before we begin, you should have a dual boot setup. I will NOT be showing you how to setup a dual boot, if you need help plenty of other guides out there.

Step 1: Creating a grub 2 boot iso

The grub iso file will allow you to specify which partition to boot into.

Boot into Ubuntu
(OPTIONAL) Configure your /boot/grub/grub.cfg This is so that you don’t accidentally boot into your Windows partition from inside Windows! Bad things will happen if you do!!!
$ gksudo gedit /boot/grub/grub.cfg
Comment out your Windows Menu, should be towards the bottom
Create the bootable iso
$ grub-mkrescue –overlay=/boot/grub GRUB2CD.iso
(DO THIS STEP ONLY IF YOU PERFORMED STEP 2)
$ gksudo gedit /boot/grub/grub.cfg
Uncomment out your Windows Menu, so you still can boot into windows after the reboot!
Move the iso into a location that is accessible by windows

Step 2: Creating the .vmdk file

This creates a file which tells Virtualbox what partition to actually load as the harddrive. Unfortunately, unlike VMWare Workstation, Virtualbox does not support a GUI interface for selecting RAW hard disks as the “virtual hard drive”.

Boot into Windows
cd into the directory you installed virtualbox
Find out which drive contains Ubuntu (if you don’t know already)
Run the command: VBoxManage.exe internalcommands listpartitions -rawdisk \.PhysicalDrive1
(where 1 is the number of the hard drive ubuntu is installed on. E.g. Master should be 0, you’re second hard-drive should be 1 etc…The output should be something like:Number Type   StartCHS       EndCHS      Size (MiB) Start (Sect)
1       0x07 0   /32 /33 1023/254/63        902023         2048
5       0x83 1023/254/63 1023/254/63         49677   1847346543
6       0x82 1023/254/63 1023/254/63          2164   1949086188In this example, my Ubuntu partition is number 5 and the swap is number 6. So my Ubuntu partition lies in PhysicalDrive1
We now create the VMDK file with the given information
Run the command: VBoxManage.exe internalcommands createrawvmdk -filename C:ubuntu.vmdk -rawdisk \.PhysicalDrive1 -register

Step 3: Setup Virtualbox

Now everything should be ready to setup Virtualbox. Create a new virtual machine. Select the .vmdk file we just created as the hard drive and mount the grub.iso file we created at Step 1.

Step 4: Running the VM

Due to the way grub 2 works, whenever you put into it you now have to load the appropriate grub config file.

Just type the above into terminal and grub should load and boot into your Ubuntu 9.10 installation.

If this has helped you in any way, please take the time to drop a comment (or a donation)! If you have any problems, just post a comment or send me an email through the “Contact me” page.

ISSUES:

For some reasons Grub does not recognise the partition if you specify the EXACT partition entry of Ubuntu. You have to specify the entire drive for it to be recognised.
It doesn’t seem to like nvidia drivers in this release so you might have to reset your x.org for it to work inside a VM. Just have to live with the lack of acceleration!

References:

Boot an existing XP (Physical HD) install with VirtualBox

Linux Bash Commands for GRUB2

November 3, 2009

HowTo: Slimming down your Windows 7

When I was on Windows XP I had this crazy obsession with making XP as slim as possible. I made custom XP installs slipstreamed with the most recent updates, disabled all the unnecessary services and only used apps that had small memory footprints.

I believe I got it down to around 18 processes on start up. Then I moved to Ubuntu and all these obsessions went away. I mean, Ubuntu would just keep chugging no matter how much crap I chucked at it.

Now I’m on Windows 7 and whilst those crazy obsessions have NOT returned, it does interest me as to what I can disable to get the most out of my system.

So here is a list of all the services I decided to disable. Just type services.msc in the start menu and hit enter. Double click the service, select stop and then “disable” from the drop down.

Disclaimer: I’m not responsible for ANYTHING yada yada yada.

Diagnostic Policy Service
This is basically that thing that goes “Windows have detected a problem, would you like to check for solutions?” TBH, the advice from that thing is actually quite useful. But I don’t really need it.
Distributed Link Tracking Client
Keeps tracks of all the “linkages”. E.g. You create a shortcut to document A. You move document A to another location. Windows will automatically update all shortcuts to point to that new location so you don’t get “File Not Found” errors. Not very useful if you ask me, unless you’re a shortcut junkie.
Function Discovery Provider Host
Used for Home Groups. Not useful if you don’t care about sharing files or have other methods of doing so.
Function Discovery Resource Publication
See 3.
IP Helper
It’s meant to help transition to IPv6 but I don’t know of any ISPs that even support IPv6 so until they do this service can go bye bye.
Offline Files
Disgusting.
Peer Name Resolution Protocol
When was the last time anyone used Remote Assistance?
Peer Networking Grouping
See 7.
Peer Networking Identity Manager
See 7.
Problem Reports and Solutions Control Panel Support
This can be disabled in the control panel. I don’t care too much for error reports.
Windows Connect Now – Config Registrar
Unless you have a “Compatible with Windows 7” sticker on your router, this is useless.
Windows Media Player Network Sharing Service
Interesting service, but I won’t ever be sharing music over the network.

I recommend you keep a list of all your changes and revert back if there are any problems.

That’s it for now. I have several other blogs lined up but just haven’t been bothered posting them. The lack of comments saddens me.

June 27, 2009

[Script] Chatlog to Email Converter Plus! RC

Today I will be releasing the Release Candidate of my Chatlog to Email Converter Plus!

I added “Plus!” as it is designed for Messenger Plus! chatlogs.

It now supports every form of Messenger chatlogs text or html and also has emoticon support.

Requirements:

Python 2.6 (preferable, 2.5 should also work but untested)
Beautiful Soup
Mozilla Thunderbird

Wishlist:

Remove Thunderbird requirement and support direct upload to specified imap server

Download:

I’ve setup a google code project for my script. Yes it’s open source!!!

Download Script

Instructions for use:

Place all your chatlogs inside one folder
Place my script within the same folder (make sure the folder does not contain any files apart from chatlogs or images)
Open up a command window inside that folder
Type “python chatlog2email.py -p -t” to convert all plain text and html chatlogs to mime format (type python chatlog2email.py –help for advanced usage)
Once it’s done you shold see a file called “Chatlogs” move that file to <Path to thunderbird preferences>Profilesxxxxxxxx.defaultMailLocal Folders
Open up Thunderbird and drag drop the chatlogs to an email account that supports imap

Easy!

If you have any problems either post here or file a bug report on google code.

If you found this useful, drop a comment or send me an email. It makes me happy 🙂