ElementTree: Working with Qualified Names
Updated December 8, 2005 | July 27, 2002 | Fredrik Lundh
The elementtree module supports qualified names (QNames) for element tags and attribute names. A qualified name consists of a (uri, local name) pair.
Qualified names was introduced with the XML Namespace specification.
Storing Qualified Names in Element Trees
The element tree represents a qualified name pair as a string of the form “{uri}local“.
The following example creates an element where the tag is the qualified name pair (http://spam.effbot.org, egg).
elem = Element("{http://spam.effbot.org}egg"}
To check if a name is a qualified name, you can do:
if elem.tag[0] == "{": ...
(you can also use startswith, but the method call overhead makes that a lot slower in current Python versions.)
Storing Qualified Names in XML Files
In theory, we could store qualified names right away in XML files. For example, let’s use the {uri}local notation in the file itself:
<{http://spam.effbot.org}egg> some content </{http://spam.effbot.org}egg>
There are two problems with this approach. One is the according to the XML base specification, { and } cannot be used in element tags and attribute names. Another, more important problem is bloat; even with a short uri like the one used in the example above, we’ll end up adding nearly 50 bytes to each element. Put a couple of thousand elements in a file, and use longer URIs, and you’ll quickly end up with hundreds of kilobytes of extra data.
To get around this, the XML namespace authors came up with a simple encoding scheme. In an XML file, a qualified name is written as a namespace prefix and a local part, separated by a colon: e.g. “prefix:local“.
Special xmlns:prefix attributes are used to provide a mapping from prefixes to URIs. Our example now looks like this:
<spam:egg xmlns:spam='http://spam.effbot.org'> some content </spam:egg>
For a single element, this doesn’t save us that much. But the trick is that once a prefix is defined, it can be used in hundreds of thousands of places. If you really want to minimize the overhead, you can pick one-character prefixes, and get away with four bytes extra per element.
However, it should be noted that xmlns attributes only affect the element they belong to, and any subelements to that element. But an element can define a prefix even if it doesn’t use it itself, so you can simply put all namespace attributes on the toplevel (document) element, and be done with it.
Qualified Attribute Values
XML-languages like WSDL and SOAP uses qualified names both as names and as attribute values.The standard parser does this for element tags and attribute names, but it cannot do this to attribute values; an attribute value with a colon in it may be a qualified name, or it may be some arbitrary string that just happens to have a colon in it.
And once the element tree has been created, it’s too late to map prefixes to namespace uris; we need to know the prefix mapping that applied to the element where the attribute appears.
To work around this, the recommended approach is to use the iterparse function, and do necessary conversions on the fly. In the following example, the namespaces variable will contain a list of (prefix, uri) pairs for all active namespaces.
events = ("end", "start-ns", "end-ns") namespaces = [] for event, elem in iterparse(source, events=events): if event == "start-ns": namespaces.append(elem) elif event == "end-ns": namespaces.pop() else: ...
Note that the most recent namespace declaration is added to the end of the list; to find the URI for a given prefix, you have to search backwards:
def geturi(prefix, namespaces): for p, uri in reversed(namespaces): if p == prefix: return uri return None # not found