An Optimization Anecdote

April 07, 2002 | Fredrik Lundh

This was originally posted to the “Is it Python or is it C” thread on comp.lang.python, in March 2000.

Q. If there is a way to optimize Python, short of memorizing it’s source code and/or years of trial and error, someone should have their next book project! (I’ll buy it!) And I say this as someone who doesn’t have enough Python experience to know if this is a true statement or not.

A. How about a sentence instead of a book?

The more time your program spends at the Python level, the more chances you have to restructure your code so that more work is done by compiled C.

To figure out how your program spends its time, use the profiler.

How about an example?

Assume you need a function which calculates a histogram over a greyscale image. This function should return a list, containing the number of pixels in the image having the same value as the index into that list.

Now, the standard way to count things is to use a dictionary. Maybe something like this will work:

def histogram1(image):
    # use a dictionary
    xsize, ysize = image.size
    data = {}
    for y in range(ysize):
        for x in range(xsize):
            v = image.getpixel((x, y))
            data[v] = data.get(v, 0) + 1
    # convert to list
    result = []
    for i in range(256):
        result.append(data.get(i, 0))
    return result

(here, image is a PIL Image object).

Running this on a small image works fine, but when the image gets larger, things will quickly go out of hand. For a 512x512 image, this takes just over 11 seconds on my machine.

Looking at the code, it’s pretty clear that we might speed things up by using a list instead of a dictionary. After all, an 8-bit grayscale image can only contain pixel values from 0 to 255. List indexing should be much faster than dictionary lookups, right?

def histogram2(image):
    # wait, use a list instead
    xsize, ysize = image.size
    result = [0] * 256
    for y in range(ysize):
        for x in range(xsize):
            v = image.getpixel((x, y))
            result[v] = result[v] + 1
    return result

Less code. But it doesn’t run much faster: just under 9 seconds, on my machine.

:::

When intuition fails, it’s time to get help. Let’s run this under the Python profiler, and see if we can figure out what’s going on here:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   17.463   17.463   58.874   58.874 <script>:26(histogram2)
   262144   29.388    0.000   41.412    0.000 Image.py:501(getpixel)

As you can see, most of the time (41 seconds when running under the profiler) is spent down in the Image module. Note that we’re doing 262144 Python-level calls to getpixel (one for each pixel, for obvious reasons).

What if we could get rid of some of these?

in fact, PIL contains a getdata method that returns a “flattened” one-dimensional view of the 2D pixel array. This lets us process all pixels inside a single for-in loop:

def histogram3(image):
    # wait, use a list, but get rid of that getpixel call
    result = [0] * 256
    for v in image.getdata():
        result[v] = result[v] + 1
    return result

Running this takes just over one (1) second on my machine. about 10 times faster than our first attempt. That’s pretty good, don’t you think?

Lets use the profiler again:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    1.234    1.234    1.234    1.234 <script>:36(histogram3)
        1    0.000    0.000    0.000    0.000 Image.py:487(getdata)

This time, there’s only one call down to Image.py — and that call doesn’t take any time at all! (The getdata method returns a reference to an already existing sequence object. A Python method cannot be much faster…)

:::

Now, assuming that one second is still too much, is there anything else we can do? Well, that “x=x+1” stuff looks fishy. Maybe we can count things a bit faster? In fact, the list object contains a method called count that does exactly this — it tells us how many times a given value is found in a list.

As it turns out, doing something like im.getdata().count(1) doesn’t work — the getdata method returns a sequence object, but that object doesn’t implement the full list interface. We can turn it into a real list using the list function:

def histogram4(image):
    # wait, use the list.count operator
    data = list(image.getdata())
    result = []
    for i in range(256):
        result.append(data.count(i))
    return result

Unfortunately, running this takes over 23 seconds. more than twice as slow as the original implementation. And some additional tests reveal that 22 of those are spent in the count method. What’s wrong here? Isn’t count implemented in C, or what?

(if you haven’t realized what’s wrong here, why not spend a minute or so and see if you can figure it out by yourself?)

:::

The problem here is that we’ve replaced 2x512x512 list accesses with 256x512x512 integer comparisions.

Even though everything’s done at the C level, doing 16 million calls to the integer object takes a while…

:::

So is one second per image really the best we can do?

Nope — a little more reading of the PIL handbook reveals a method that does all this for us, using a tightly written C loop.

def histogram5(image):
    return image.histogram()

This script takes slightly less than 0.05 seconds on my machine. That’s over 200 times faster than the first version, and more than 450 times faster than the slowest one…

Summary

Summing up, the more you do in C, the faster your program runs. But C doesn’t help if you use a lousy algorithm…

(one early PIL adopter told me they’d rewritten a C++ program in Python, using PIL, and was rather surprised to find that the Python code ran nearly three times faster ;-)

Or in other words, if your Python program doesn’t run fast enough, use the profiler.