Sandbox: SourceForge Tools (In Progress)

Fredrik Lundh | April 2006

Note: The sourceforge page layout was changed slightly after the first version of these tools were released. We’re working on a new version, but if you want to use the tools to experiment with older tracker snapshots, you need version 200604. See below.

The ~~sourceforge sandbox~~ (dead link) contains a set of simple tools to download and process sourceforge tracker items.

You need either Python 2.4 and the ElementTree library (cElementTree is recommended), or Python 2.5 (which ships with cElementTree).

To run the download tools, you also need the tidy utility.

Current Version (200608, Work in Progress)

To download the current version of the tools, use Subversion:

$ svn co http://svn.effbot.python-hosting.com/stuff/sandbox/sourceforge

Previous version (200604)

This version is compatible with the sourceforge tracker layout used in April 2006.

$ svn co http://svn.effbot.python-hosting.com/tags/sourceforge-200604/

A snapshot of the Python tracker data from April 2006 can be downloaded here:

~~tracker-20060403.zip~~ (dead link) [~10000 items, 80 MB]

Tracker Datasets #

Tracker data is represented as a set of files in a tracker directory. For each tracker item, there are at least two files:

    tracker-TTT/item-NNN.xml (index information, created by getindex.py)
    tracker-TTT/item-NNN-page.xml (xhtml pages, created by getpages.py)

where TTT is the tracker identifier, and NNN is the item identifier.

For items that have attached files, there’s also one or more

    tracker-TTT/item-NNN-data-MMM.dat (data files, created by getfiles.py)

files, where MMM is a file identifier (referred to by the page files). The data files consists of a copy of the HTTP header (which includes content-type and content-disposition headers), followed by an empty line, and the actual data.

Note that the datasets contain complete HTML pages. This lets you fix bugs in the extraction tools without having to reload everything again (or download large existing datasets).

Processing Tracker Datasets #

To process tracker datasets, use the extract module to extract relevant information from item-NNN-page.xml files. See the export scripts for examples:

~~csv-export.py~~ (dead link): A simple dataset to CSV exporter.
~~xml-export.py~~ (dead link): A simple dataset to XML exporter. The resulting XML file contains all data from the tracker dataset, including attached files (stored as BASE64-encoded blocks).

More export scripts, bug fixes, and other contributions are welcome.

Downloading and Updating Tracker Datasets

To download tracker datasets, run 'init' to set things up, and use the
getindex/getpages/getfiles scripts to download items.

* init

The 'init' script is used to select what tracker to download.  It asks
for a tracker "group id".  To get the group id for your project, check
the URL for the tracker homepage.  If you press return, the group id
defaults to 5470, which is the group id for the Python tracker.

The 'init' script downloads the tracker homepage, and creates tracker
directories for the individual trackers used by the given project.

    $ python init.py

    enter sourceforge tracker group id [5470]: 1234

    --- create tracker-123456

You only have to run the 'init' script once for each project.

* getindex

The 'getindex' script parses the tracker index, and creates item
files which contains overview information from the index pages.
Usage:

    $ python getindex.py tracker-123456 [offset]

If the offset is omitted, the parser starts at offset 0, and keeps
going until it gets an index page for which all items have already
been downloaded.  If an offset is given, the parser keeps going until
it cannot find any more items.

You can use the output from 'getindex' to generate tracker statistics.
To get more information about the items, use the 'getpages' and 'get-
files' scripts.

* getpages

The 'getpages' script looks for item files, and downloads missing page
files.

    $ python getpages.py tracker-123456

To refresh the page files, remove them from the tracker directory, and
run the 'getpages' script again.

    $ rm tracker-123456/*-page.xml
    $ python getpages.py tracker-123456

* getfiles

The 'getfiles' script, finally, looks for download links in the
page files, and downloads missing data files.

    $ python getfiles.py tracker-123456

* status

The 'status' script can be used to get a download status summary:

    $ python status.py
    tracker-123456
        6682 items
        6682 pages (100%)
        1912 files