Introducing ElementTree 1.3
Fredrik Lundh | September 2007
Episode I: The Boring Parts #
ElementTree 1.3 is an incremental update to the ElementTree 1.2.6 library. You can get the latest ET 1.3 alpha via Subversion, from here:
Or here, if you prefer a stable link to the current alpha:
http://svn.effbot.org/public/tags/elementtree-1.3a3-20070912/
(To install, use svn co or svn export on one of the above URI:s, and then run the setup.py script.)
There’s also a companion version of cElementTree, tentatively called 1.0.6. It will appear on a site near you in a not too distant future.
In the meantime, here’s an article that covers most of the core enhancements in the first round of alphas. There’s a couple of other enhancements coming up as well; they will appear in the first beta release, and deserve their own an article (or two). But for now, let’s focus on the ElementTree core.
Element Improvements #
The Element class has undergone a few small enhancements. First, the Element callable used to be a factory function, but is now a real class. This means that you can inherit from it:
class MyElement(Element): def tostring(self): return ET.tostring(self)
This is mostly of academical interest, though. The ET philosophy is still geared towards the use of helper functions to manipulate trees. ET is a Python library, after all.
The new methods in this class are probably a bit more useful, for most programmers:
extend #
extend appends items from a sequence to the element, just as for list objects.
elem.extend(other.findall("p"))
iter #
iter is the new name for getiterator; in ET 1.3, it’s implemented as a generator method, but is otherwise identical to the old version:
for e in elem.iter(): ...
Note that in 1.2, getiterator returns a list. To get the same behaviour in 1.3, use the list function:
elements = list(elem.iter())
assert isinstance(elements, list)
itertext #
Finally, itertext is a generator method that returns all “inner text” in an element.
for text in elem.itertext(): print repr(text) file.writelines(elem.itertext())
The generated sequence include data from the text attribute for the element itself, and from the text and tail attributes for all subelements.
To get all text as a single string, use the join method:
def gettext(elem): return "".join(elem.itertext())
Parser Improvements #
The various parser functions (parse, XML, fromstring, and the parse method in the ElementTree class) are now all based on the XMLParser class (more on this below).
This parser now raises a ParseError exception if something goes wrong. This exception is a subclass of SyntaxError, with an additional position attribute that contains the row number (starting at one) and the column number for where the error was found.
try: elem = ET.XML(text) except ET.ParseError, v: row, column = v.position print "error on row", row, "column", column, ":", v
All parser functions now take an optional parser keyword argument, which can be used to explicity pass in a parser instance. You can use this to override the document encoding:
parser = ET.XMLParser(encoding="utf-8") root = ET.parse("file.xml", parser=parser)
See Other Changes below for more on this.
Writer Improvements #
The write method in the ElementTree class has undergone a complete overhaul, and now uses a new, more flexible serializer framework. Some highlights:
-
The new serializer puts all namespace declarations on the root element. No more duplicate xmlns attributes on sibling elements.
-
The serializer can produce XML, HTML, and plain text output.
-
When generating XML, you can control how the XML declaration is omitted (always, never, or only when needed, as in 1.2). There’s also an experimental feature (at least in the alphas) that lets you to specify a default namespace.
-
There’s an official API for the standard prefix table; no need to add stuff to an internal dictionary.
-
The serializer omits the start and end tag for elements for which the tag attribute is set to None; this can be used to quickly remove an element from a tree, without losing the content.
-
The new serializer is a lot faster. On a selection of typical XML files, it’s about twice as fast as the one in 1.2.6 (Python 2.5), in my tests. Your milage may vary, of course, but it should be faster than before on all trees.
XML Output #
XML is the default output format, and works pretty much as in 1.2:
tree.write("out.xml") tree.write("out.xml", method="xml")
You can provide an encoding, if necessary:
tree.write("out.xml", encoding="utf-8")
As before, the serializer uses character references for character data and attribute values that cannot be encoded in the given encoding. However, in 1.3, only the offending character is escaped, rather than the whole text fragment.
You can use the register_namespace function to add “well-known” prefixes to the serializer.
ET.register_namespace("dc", "http://purl.org/dc/elements/1.1/")
This adds the “dc” prefix to a global table. There’s no way to specify prefixes only for a given call to write; that may be fixed before the final release.
You can use the xml_declaration option to control if an XML declaration should be output or not. If True, the declaration is always written. If False, the declaration is never written. If omitted or None, ET 1.3 uses the old ET 1.2 behaviour, which
tree.write("out.xml", xml_declaration=True)
The default_namespace attribute specifies a namespace that should be used as the default in the file. The serializer will put the necessary xmlns attribute on the root element, and omit the prefix for all elements that belong to this namespace:
tree.write("out.xml", default_namespace="http://www.w3.org/1999/xhtml")
This feature is somewhat experimental in the current alpha. Among other things, all elements in the tree must use a namespace for this to work; the default namespace cannot be “undeclared”.
HTML Output #
The html output method is similar to XML, but omits the end tags for elements like link, input, etc. It also handles script and style elements properly, and uses HTML-specific quoting for attribute values. To get HTML output, pass in html to the method option:
tree.write("out.html", method="html")
The html output method is still a bit experimental, and may be modified somewhat before the final release. Bug reports and other suggestions are welcome.
Text Output #
The text output method simply skips all tags, and just outputs the contents of the text and tail attributes, in the given encoding.
tree.write("out.txt", method="text", encoding="utf-8")
This is similar to itertext, except that the output is encoded, and the tail attribute on the root element is included in the output.
Other changes #
XMLParser #
The XMLParser classes replaces the old XMLTreeBuilder class (the old name is still available, of course). The parser now takes an optional encoding argument, which can be used to override the file’s internal encoding:
parser = ET.XMLParser(encoding="utf-8")
parser.feed(data)
elem = parser.close()
You can also pass a configured parser object to the parse, fromstring, and XML functions:
parser = ET.XMLParser(encoding="utf-8")
tree = ET.parse(source, parser=parser)
The encoding option is most often used to parse XML that might have been transcoded, such as when sent over HTTP. You can also use it to parse XML available as Unicode:
elem = XML( text.encode("utf-8"), parser=XMLParser(encoding="utf-8") )
The tostring and fromstring functions have gotten tostringlist and fromstringlist companions; they work in the same way, but works on lists of string fragments instead of strings.
fromstring and fromstringlist both take an optional parser keyword argument. Likewise, tostring and tostringlist both take an optional method argument.
Performance #
The most notable performance improvement is the new serializer; in my tests, it’s usually about twice as fast as the one in 1.2.6.
Parsing speed is similar to before, but ElementTree now allows you to use the parser from cElementTree 1.0.6 to build ordinary trees. To enable this feature, do:
import elementtree.ElementTree as ET import cElementTree ET.XMLParser = cElementTree.XMLParser
Note that this requires cElementTree 1.0.6 or later; a bug in the internal tree builder in earlier versions makes it in-compatible with ElementTree 1.3.
Deprecated and Removed Features #
getchildren #
getchildren is deprecated, and issues a warning. You can use sequence operations on the element itself instead:
for e in elem.getchildren(): ... for e in elem: ...
If you need a list object, use list to convert the Element sequence to a list object:
children = list(elem)
assert isinstance(children, list)
getiterator #
The getiterator method has been replaced by iter (which is also a true iterator in 1.3). This applies to both Element and ElementTree. The old version still works as before; it will be deprecated in the next release.
Truth testing #
The Element type now issues a warning when used in a “boolean context”. To get rid of the warning, make the test explicit:
if len(elem): ... has at least one children ... elem = root.find("tag") if elem is not None: ... found ...
Explicit tests work just fine in ET 1.2, of course.
The boolean interpretation will most likely change in future versions, so that all elements evaluate to true, also if they have no children.
SimpleXMLTreeBuilder #
The SimpleXMLTreeBuilder module has been removed. This module used xmllib to do the parsing, but since expat has been available from since around Python 2.0 or so, keeping the old support is pretty pointless.
Note: Unless you’re using Jython, I’m told. It will be back in the next alpha, at least temporarily, but will issue a DeprecationWarning for other platforms.
SgmlopTreeBuilder #
The SgmlopTreeBuilder module has been removed; for maximum performance, use cElementTree instead. You can also use cElementTree’s XMLParser with ET 1.3; see the performance section above for details.
XMLTreeBuilder #
The XMLTreeBuilder module has been removed. For detailed namespace access, use iterparse and the start-ns and end-ns events.