Amador Pahim

Python

The Price of Code Golf

apahim

I still remember the feeeling. Years ago, like, many years ago, I was sitting at a Free Software conference, watching this presentation “Shell Script OneLiners” (or something like that). The whole thing was around command line sorcery and the magic things we can achieve with one line of Shell script.

– “Two lines?! Why so big? Let’s improve that!”

That was fun, challenging and I learned a lot from that exercise on my daily work as a Sysadmin (if you’re too young, that’s how we used to call SREs before we called them DevOps. Well, not to oversimplify, first we had to trick them into coding, then to teach them SLO, SLI and Error Budget).

In the System Administration world, Code Golf, you know, making an effort to write the code in fewer lines, usually makes sense. You have a specific problem to solve, the code to solve it won’t be reviewed or maintained. Actually it will hardly be saved other than in your Bash history. You need it again? Just recreate it from scratch.

In that world of |, >, $(), sed, awk and xargs, the limit is the immagination. You want to tail -f a file, but converting the timestamp within the square brackets to human readable time information? Not a problem:

$ tail -f /var/log/foobar.log | awk -F '[][]' '{system("echo -n $(date -d @\"" $2 "\")") ; $2=""; print $0 }'

Of course I’d use that mindset for whatever programming language I would put my hands on. I was taking my first steps with Python and the fact that Python can be developed interactivelly is an invitation for Code Golf. What about the comprehension syntax? The lambda function, map(), max(), min(), all(), any, sort()… I couldn’t resist!

It wasn’t until I joined a Software Engineering team when I started being concerned about the price of being short. I was developing a Python testing framework, the Avocado Project. When developing a testing framework, a new mindset should take place:

That new mindset is not meant to make you optimize the code in advance. It rather makes you think about the performance and the error paths per code line basis. At the beginning, there’s a lot of thinking and profiling. After some time, it becomes automatic to write code that is fast, safe and readable. And that code is usually not short.

Enough talking, let’s use an example. Say you have a Python dictionary and one of the keys has a JSON payload as the value:

data = {'code': 401,
        'message': '{"error": "authorization required"}'}

To get that message, you would:

import json

def get_message_v1(data):
    message_json = data['message']
    return json.loads(message_json)

But the message might not be there. In that case, a KeyError will be raised. If that happens, I want the function to return an empty dictionary:

import json

def get_message_v1(data):
    try:
        message_json = data['message']
        return json.loads(message_json)
    except KeyError:
        return {}

That’s a readable code. And it’s fairly fast for the case when the message is present in the data dictionary:

>>> import json
>>> 
>>> from timeit import timeit
>>> 
>>> 
>>> def get_message_v1(data):
...     try:
...         message_json = data['message']
...         return json.loads(message_json)
...     except KeyError:
...         return {}
... 
>>> 
>>> response = {'code': 401,
...             'message': '{"error": "authorization required"}'}
>>>
>>> print(timeit('get_message_v1(response)', globals=globals()))
1.7839420250093099

On the other hand, exception handling is not the cheapest operation in Python:

>>> import json
>>> 
>>> from timeit import timeit
>>> 
>>> 
>>> def get_message_v1(data):
...     try:
...         message_json = data['message']
...         return json.loads(message_json)
...     except KeyError:
...         return {}
... 
>>> 
>>> response = {'code': 401}
>>>
>>> print(timeit('get_message_v1(response)', globals=globals()))
0.30391961097484455

And because I’d expect that exception to happen quite often, maybe I should not look at it as an exception, but as a normal case. To make it faster, we can try using the get() method of the dictionary. It’s supposed to be fast for the case when the key does not exit, because instead of handling the KeyError exception, it it returns a default when the key is not present:

import json

def get_message_v2(data):
    message_json = data.get('message', None)
    if message_json is None:
        return {}
    return json.loads(message_json)

Shall we profile it? 🙂

>>> import json
>>> 
>>> from timeit import timeit
>>> 
>>> 
>>> def get_message_v2(data):
...     message_json = data.get('message', None)
...     if message_json is None:
...         return {}
...     return json.loads(message_json)
... 
>>> 
>>> response = {'code': 401,
...             'message': '{"error": "authorization required"}'}
>>> print(timeit('get_message_v2(response)', globals=globals()))
1.8193615709792357
>>> 
>>> response = {'code': 401}
>>> print(timeit('get_message_v2(response)', globals=globals()))
0.1505286100055091

So, our tests are showing that the first case, when the key is there, didn’t change significantly. But the second case, when the key is not present, went from 0.303s to 0.150s for one million executions. How cool is that?

We should stop there. But you know, OneLiners, right? Let’s try something shorter:

import json

def get_message_v3(data):
    return json.loads(data.get('message', '{}'))

That code looks really neat! When the key is not there, the default will be a JSON payload that, when loaded, becomes an empty dictionary.

Ready for the price of Code Golf?

>>> import json
>>> 
>>> from timeit import timeit
>>> 
>>> 
>>> def get_message_v3(data):
...     return json.loads(data.get('message', '{}'))
... 
>>> 
>>> response = {'code': 401,
...             'message': '{"error": "authorization required"}'}
>>> print(timeit('get_message_v2(response)', globals=globals()))
1.8497470380179584
>>>
>>> response = {'code': 401}
>>> print(timeit('get_message_v3(response)', globals=globals()))
1.7090334280219395

Again, not much difference for when the key is there. However, when the key is not there, that implementation is 10x slower! That’s just because we are calling json.loads()when we don’t really need it.

“You’re being dramatic”, you might be thinking. Just use a ternary operator and you have your cheap one-liner:

import json

def get_message_v4(data):
    return json.loads(data['message']) if 'message' in data else {}

Before I comment, let’s benchmark it:

>>> import json
>>> 
>>> from timeit import timeit
>>> 
>>> 
>>> 
>>> def get_message_v4(data):
...     return json.loads(data['message']) if 'message' in data else {}
... 
>>> 
>>> response = {'code': 401,
...             'message': '{"error": "authorization required"}'}
>>> print(timeit('get_message_v4(response)', globals=globals()))
1.6530073219910264
>>> 
>>> response = {'code': 401}
>>> print(timeit('get_message_v4(response)', globals=globals()))
0.09722235893644392

Ok, it’s one line, and it’s super cheap. The cheapest so far in every aspect. My problem with ternaries is that nested conditions are hard to read. PEP20 knows it. This one is quite straight forward, but once you get used to them, it’s hard to avoid using them in more complex expressions.

Let’s check the flat equivalent to that ternary:

import json

def get_message_v5(data):
    if 'message' in data:
        return json.loads(data['message'])
    return {}

Timming it:

>>> import json
>>> 
>>> from timeit import timeit
>>> 
>>> 
>>> def get_message_v5(data):
...     if 'message' in data:
...         return json.loads(data['message'])
...     return {}
... 
>>> 
>>> response = {'code': 401,
...             'message': '{"error": "authorization required"}'}
>>> print(timeit('get_message_v5(response)', globals=globals()))
1.6680084750941023
>>> 
>>> response = {'code': 401}
>>> print(timeit('get_message_v5(response)', globals=globals()))
0.08939649996906519

Fast and readable. My kind of code 🙂

I would be wary when using that implementation though. There are two dictionary lookups already. One in the if condition and another in the json.loads() call. That implementation could lead you to use the dictionary lookup as the normal way of having that value, instead of attributing it to a variable. The dictionary lookups are quite cheap, but acessing a variable is cheaper. Rule of thumb is: if you’re going to use that value more than once inside your function, first put it in a variable.

Anyway, I could go on and on with examples here, but I guess you got my point by now. A couple of learnings:

Short code is fun to make. Just make sure you’re not paying too much for it. But writing fast code that is easy to read is, in my opinion, a form of art.

Happy hacking!

:wq!

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top