A Simple BitTorrent "bencode" Decoder
Fredrik Lundh | August 2007
Here’s a simple decoder for BitTorrent’s torrent encoding format (dead link), bencode. This uses the iterator-based approach from Simple Iterator-based Parsing, which results in readable and pretty efficient code.
import re def tokenize(text, match=re.compile("([idel])|(\d+):|(-?\d+)").match): i = 0 while i < len(text): m = match(text, i) s = m.group(m.lastindex) i = m.end() if m.lastindex == 2: yield "s" yield text[i:i+int(s)] i = i + int(s) else: yield s def decode_item(next, token): if token == "i": # integer: "i" value "e" data = int(next()) if next() != "e": raise ValueError elif token == "s": # string: "s" value (virtual tokens) data = next() elif token == "l" or token == "d": # container: "l" (or "d") values "e" data = [] tok = next() while tok != "e": data.append(decode_item(next, tok)) tok = next() if token == "d": data = dict(zip(data[0::2], data[1::2])) else: raise ValueError return data def decode(text): try: src = tokenize(text) data = decode_item(src.next, src.next()) for token in src: # look for more tokens raise SyntaxError("trailing junk") except (AttributeError, ValueError, StopIteration): raise SyntaxError("syntax error") return data
Most encoded objects consist of a type code followed by the object value and an end code; the exception is encoded strings that are prefixed with their length. To simplify the parser, the tokenize function returns strings as two “virtual” tokens; a type code followed by the string value. Some examples:
>>> list(tokenize("4:spam")) ['s', 'spam'] >>> list(tokenize("i3e")) # int ['i', '3', 'e'] >>> list(tokenize("i-3e")) ['i', '-3', 'e'] >>> list(tokenize("l4:spam4:eggse")) # list ['l', 's', 'spam', 's', 'eggs', 'e'] >>> list(tokenize("d3:cow3:moo4:spam4:eggse")) # dict ['d', 's', 'cow', 's', 'moo', 's', 'spam', 's', 'eggs', 'e']
Note that simpler formats can usually use the finditer method to split things up (or even findall).
The decode_item function uses the type information to convert the data to a suitable Python object. It calls itself to deal with nested container types.
Usage #
Here’s a snippet that lists all files included in a torrent file:
data = open("test.torrent", "rb").read() torrent = decode(data) for file in torrent["info"]["files"]: print "%r - %d bytes" % ("/".join(file["path"]), file["length"])
Running this will produce something like:
'Little Earthquakes/01 Crucify.m4a' - 4845721 bytes 'Little Earthquakes/02 Girl.m4a' - 4012517 bytes 'Little Earthquakes/03 Silent All These Years.m4a' - 4076790 bytes 'Little Earthquakes/04 Precious Things.m4a' - 4328948 bytes 'Little Earthquakes/05 Winter.m4a' - 5538530 bytes 'Little Earthquakes/06 Happy Phantom.m4a' - 3204091 bytes 'Little Earthquakes/07 China.m4a' - 4859246 bytes 'Little Earthquakes/08 Leather.m4a' - 3125716 bytes 'Little Earthquakes/09 Mother.m4a' - 6785591 bytes 'Little Earthquakes/10 Tear In Your Hand.m4a' - 4515482 bytes 'Little Earthquakes/11 Me And A Gun.m4a' - 3649914 bytes 'Little Earthquakes/12 Little Earthquakes.m4a' - 6663794 bytes