A Simple BitTorrent "bencode" Decoder

Fredrik Lundh | August 2007

Here’s a simple decoder for BitTorrent’s ~~torrent encoding format~~ (dead link), bencode. This uses the iterator-based approach from Simple Iterator-based Parsing, which results in readable and pretty efficient code.

import re

def tokenize(text, match=re.compile("([idel])|(\d+):|(-?\d+)").match):
    i = 0
    while i < len(text):
        m = match(text, i)
        s = m.group(m.lastindex)
        i = m.end()
        if m.lastindex == 2:
            yield "s"
            yield text[i:i+int(s)]
            i = i + int(s)
        else:
            yield s

def decode_item(next, token):
    if token == "i":
        # integer: "i" value "e"
        data = int(next())
        if next() != "e":
            raise ValueError
    elif token == "s":
        # string: "s" value (virtual tokens)
        data = next()
    elif token == "l" or token == "d":
        # container: "l" (or "d") values "e"
        data = []
        tok = next()
        while tok != "e":
            data.append(decode_item(next, tok))
            tok = next()
        if token == "d":
            data = dict(zip(data[0::2], data[1::2]))
    else:
        raise ValueError
    return data

def decode(text):
    try:
        src = tokenize(text)
        data = decode_item(src.next, src.next())
        for token in src: # look for more tokens
            raise SyntaxError("trailing junk")
    except (AttributeError, ValueError, StopIteration):
        raise SyntaxError("syntax error")
    return data

Most encoded objects consist of a type code followed by the object value and an end code; the exception is encoded strings that are prefixed with their length. To simplify the parser, the tokenize function returns strings as two “virtual” tokens; a type code followed by the string value. Some examples:

>>> list(tokenize("4:spam"))
['s', 'spam']
>>> list(tokenize("i3e")) # int
['i', '3', 'e']
>>> list(tokenize("i-3e"))
['i', '-3', 'e']
>>> list(tokenize("l4:spam4:eggse")) # list
['l', 's', 'spam', 's', 'eggs', 'e']
>>> list(tokenize("d3:cow3:moo4:spam4:eggse")) # dict
['d', 's', 'cow', 's', 'moo', 's', 'spam', 's', 'eggs', 'e']

Note that simpler formats can usually use the finditer method to split things up (or even findall).

The decode_item function uses the type information to convert the data to a suitable Python object. It calls itself to deal with nested container types.

Usage #

Here’s a snippet that lists all files included in a torrent file:

data = open("test.torrent", "rb").read()

torrent = decode(data)

for file in torrent["info"]["files"]:
    print "%r - %d bytes" % ("/".join(file["path"]), file["length"])

Running this will produce something like:

'Little Earthquakes/01 Crucify.m4a' - 4845721 bytes
'Little Earthquakes/02 Girl.m4a' - 4012517 bytes
'Little Earthquakes/03 Silent All These Years.m4a' - 4076790 bytes
'Little Earthquakes/04 Precious Things.m4a' - 4328948 bytes
'Little Earthquakes/05 Winter.m4a' - 5538530 bytes
'Little Earthquakes/06 Happy Phantom.m4a' - 3204091 bytes
'Little Earthquakes/07 China.m4a' - 4859246 bytes
'Little Earthquakes/08 Leather.m4a' - 3125716 bytes
'Little Earthquakes/09 Mother.m4a' - 6785591 bytes
'Little Earthquakes/10 Tear In Your Hand.m4a' - 4515482 bytes
'Little Earthquakes/11 Me And A Gun.m4a' - 3649914 bytes
'Little Earthquakes/12 Little Earthquakes.m4a' - 6663794 bytes