The Big Python Migration – Mysteries of HTML2.py
At work we’re currently migrating our webapp from webware to flask. As part of this migration, I’ll be documenting parts of my endeavor. This will be the first post of the series.
In our Python web app, we now officially have 3 ways to generate HTML code.
The original method, which was used before my time, involved crafting the html manually. The previous developer probably thought that this was a waste of time so he had made a bunch of helper methods to do this. Unfortunately, it was still terrible since you got stuff like:
def td(self, content): self.write("<td>") self.write(content)
Yes, it doesn’t even bother closing the tag. The actual method is longer but I shortened it for brevity.
So someone decided to make it “better”, and so HTML2.py was born.
Here’s what it looks like:
class TAG(object): def __init__(self, tag='TAG', contents=None, no_contents=False, **attributes): self.no_contents = no_contents contents = contents or [] if not isseq(contents): contents=[contents] self.tag = tag self.contents = contents self.attributes = attributes def __add__(self, other): self.contents.append(other) return self def addAttrs(self, **attrs): self.attributes.update(attrs) def __str__(self): try: return ''.join(self.str_g()) except: log.exception(str(type(self)) + ' ' + str(self.contents)) if self.contents: log.exception(str(self.contents[0])) raise def str_g(self): yield '<' + self.tag for (a, v) in self.attr_g(): yield ' ' + a + '=' + self._attr_quote + str2(v) + self._attr_quote if self.NO_CONTENTS: yield '/>' else : yield '>' for c in self.contents: if isinstance(c, TAG): for i in c.str_g(): yield i else: yield str2(c) yield '</' + self.tag + '>' def attr_g(self): for a, v in self.attributes.iteritems(): if isseq(v): for _v in v: yield a, _v else: yield a, v
Again…shortened for brevity.
So basically you have an object that represent every tag:
class SPAN(TAG): def __init__(self, contents=None, **attributes): TAG.__init__(self, 'SPAN', contents, **attributes) class DIV(TAG): def __init__(self, contents=None, **attributes): TAG.__init__(self, 'DIV', contents, **attributes) class PRE(TAG): def __init__(self, contents=None, **attributes): TAG.__init__(self, 'PRE', contents, **attributes) class A(TAG): def __init__(self, contents=None, **attributes): TAG.__init__(self, 'A', contents, **attributes) class P(TAG): def __init__(self, contents=None, **attributes): TAG.__init__(self, 'P', contents, **attributes)
Here’s some code as to demonstrate how you would construct a table.
table = TABLE() tr = TR() tr += TD("Some value") tr += TD("Some value") table += tr()
Considering that the folks at web2py does a “similar” job, I guess this design isn’t THAT bad. But our underlying implementation is surely terrible.
I’ve gotten used to this implementation, and since it’s a blackbox of sorts, it never occurred to me ever fix it. However, recently I was working on a page that for some reason took significantly longer to load.
This page took a whole 10 seconds to load on my machine. The weird thing was that this page did not have any complicated business logic nor was it doing any intensive queries.
I isolated the code to one specific function which involved rendering drop downs. The drop downs themselves actually contained around 5000 elements. Hmmm. I added some code around this to time how long it took and I ran the report.
1.6 seconds.
1.6 seconds to generate a drop down with 5000 elements.
The page had 3 of these elements so that’s at least 4.8 seconds.
This is where I decided to try and optimise HTML2.py. I should have done it a long time ago but for whatever reason I chose not to.
The first thing I thought to do was use lxml. I had actually used lxml to replace couple of our XML related APIs. It managed to turn a 35 second request into a mere 12 second request, but that’s a different story. So I thought “I’ll just use lxml then”. After spending 10 minutes on it, it was looking hopeful, but immediately I could see problems.
I was constructing etree elements within the to_string method. So for ever tag element I was duplicating it with an etree element. I’ll definitely need to fix this. Another problem was that lxml escapes html automatically and there’s no way to NOT escape the html. Unfortunately, due to the way our web app was designed, NOT escaping html is a feature…not a vulnerability. So I quickly reverted my code.
Being the lazy programmer that I am, and also the fact that I was incredibly hungry at the time. I posted up a stackoverflow question in hope that somebody can help me come up with a solution to this problem.
I came back from lunch. Refreshed the page and was surprised to see two solutions.
The first solution that I read happened to use lxml as well and looked quite similar to my solution, however, since I knew this wasn’t going to work out I read the next solution.
It was such an elegant solution that I couldn’t believe why I never came up with it myself. It was simple and obvious. I think it was one of those situations where I’ve been looking at it for so long that over time I’ve become accustomed to the style. Or perhaps I’m just creating excuses for my own ineptness.
So after implementing it and the code became this:
class TAG(object): __slots__ = ["contents", "attributes"] tag = "TAG" def write(self, writer=Writer()): writer.write(self.to_string()) def __init__(self, contents=None, **attributes): contents = contents or [] if not isseq(contents): contents = [contents] self.contents = contents self.attributes = attributes def __add__(self, other): self.contents.append(other) return self def addAttrs(self, **attrs): self.attributes.update(attrs) def __unicode__(self): return self.to_string() def to_string(self): return """<{tag}{attributes}>{content}</{tag}>""".format( tag=self.tag, attributes=''.join(' %s="%s"' % (attr, _unicode(val)) for attr, val in self.attr_g()), content=''.join(_unicode(n) for n in self.contents) ) def attr_g(self): for a, v in self.attributes.iteritems(): if isseq(v): for _v in v: yield a, _v else: yield a, v
I ran my adhoc benchmark again. 0.6 seconds.
0.6 seconds to generate the same drop down box with 5000 elements.
60% performance improvement.
I timed the entire page and it took around 4 seconds. I couldn’t believe my eyes. Yes, I know 4 seconds is still incredibly terrible…compared to what it was before it’s still a reason for celebration.
So. Fast forward 1 month and I’m writing this blog. I was going to write up a more scientific benchmark, however, something screwed up. Well…I screwed up more precisely.
Here’s my simple benchmark
import time from Lib.HTML.HTML2 import * start_time = time.time() table = TABLE() for _ in xrange(100000): table += TR(TD("TEST", some_attribute="foo")) html_str = str(table) print time.time() - start_time
I ran it against the original code. 8.84 seconds. So I thought…I should be getting around 4 seconds with the new version. Can you guess what happened?
10.24 seconds
HOW DID THIS HAPPEN?!
HOW DID I GET THE ORIGINAL BENCHMARK?
I was extremely worried so I began analysing everything again. I compared the outputted html from both version and they matched character for character.
I even pulled up an older version of our app and tested it again and for some mysterious reason it was performing just as well after the changes. I felt somewhat defeated. I had spent time working on something that provided arguably very little value. The fact that the code is A LOT more readable may not be enough to convince people that it’s a worthwhile change. It would have been too late to change it back anyway but I was sad knowing that my code had no performance benefits.
Not giving up I commented out pieces of code to find out if there’s any inefficiencies that may hint at where I went wrong. What I found out was that when I commented out:
if isseq(v): for _v in v: yield a, _v else:
My code ran at 6.7 seconds! I repeated this test over and over again and my eyes did not fool me. So I delved deeper.
I found this
def isseq(i): if not '__contains__' in dir(i): return False if isinstance(i, basestring) or isinstance(i, dict): return False return True
Do you see it?
I facepalmed. It was checking to see if the object “i” responded to the message “__contains__” however it was doing it in an extremely inefficient manner. Quoting the official documentation for the “dir” function:
Without arguments, return the list of names in the current local scope. With an argument, attempt to return a list of valid attributes for that object.
I changed this to:
def isseq(i): if not hasattr(i, '__contains__'): return False if isinstance(i, basestring) or isinstance(i, dict): return False return True
and ran my test again.
2.4 seconds
Holy mother of crap. isseq was being called on object creation and also when generating the html. This process took up the vast majority of processing. I had done some Apache benchmark tests and on one particular page it would take 37.89 seconds to process 500 requests with 16 threads. After the change it takes just 8.007 seconds. A speed up of nearly 5x just by optimising ONE line of code.
How about that page that was taking 4 seconds? 0.45 seconds.
So in the end, whilst the main aim of the code change did little, the end result was quite drastic simply due to me exploring around in the codebase. What annoys me is that if I had noticed this earlier, our system would have behaved a lot faster a lot earlier.
I still have no idea how I managed to get those initial benchmarks…but I swear to god I was not dreaming or making those numbers up.
Well this is it for now. I’ll probably be posting up another series on the actual Flask migration and discuss further peculiarities of our system.
No opportunity to switch to a HTML templating language such as Jinja2?
Not for the existing codebase. We’re are using Jinja2 when it makes sense though. I created a base.html template that has a content block where the content block is the html generated from the above monstrosity. Baby steps.