PyZone Archive Files

Fredrik Lundh | November 2006

A PyZone archive file represents a collection of documents, such as a zone, in a single XML file.

Here’s an outline of this format:

<zone name='identifier' last-modified='timestamp'>
  <article name='identifier' category='list of categories'>
    <title> article title </title>
    <body>
       article body text (an XHTML body fragment), minus the title
    </body>
    <comments count='count'> (optional)
      <comment> text </comment> (optional)
      ...
    </comments>
    <terms source='mechanism'> (optional)
      <term> phrase </term>
      ...
    <terms>
    ...
    <source> source url </source> (optional)
</zone>

Elements #

The body element contains XHTML body fragments. Link targets in a body can be either usual “http:” targets (for external links), or special “link:” targets, which need to be converted by the renderer. A link target has the following syntax:

    link:target-domain:identifier

where the target-domain is one of

zone: Articles within this zone.
python: The standard Python namespace, with keywords, builtins, and library functions. To identify what a link points to, start by checking the first part against available keywords, builtins, and library modules, in that order. You can then use remaining parts of the link, if any, to drill down further, or just link to the top-level concept. There’s currently no mechanism to distinguish between builtins and modules with the same name (e.g. repr).
svn: The python source code namespace. This is used to refer to Python source files. The identifier is the full path to the file, relative to the source directory.
c: A C function or macro, either part of C’s standard library, or the Python C library.
pep: Python enhancement proposals. The identifier is the pep number, given as a decimal integer. (the number may or may not have leading zeros, so it should be normalized by the link translator).
rfc: Internet RFC. The identifier is the rfc number, given as a decimal integer.

The comments element may be included for articles with comments; if present, it usually only contains a count attribute. You can use the source element, if present, to link back to the page that holds the comments.

Each article may also have one or more terms elements. These contain terms and keywords, either added manually, or via automatic term extraction mechanisms.

Examples #

The Python FAQ staging area is available as an archive file. You can get a copy from this URL:

http://effbot.org/pyfaq.xml (~400k)

If you decide to use this for anything except testing, please use a conditional fetch to make sure that you only download it if it has actually changed. You can use something like:

import urllib2

def get_signature(uri):
   request = urllib2.Request(uri)
   request.get_method = lambda: "HEAD"
   http_file = urllib2.urlopen(request)
   return "/".join((
       http_file.headers["last-modified"],
       http_file.headers["etag"],
       http_file.headers["content-length"]
       ))

to get get the file signature, and do a full fetch if the signature has changed since you last downloaded the file.