Element Tree Infosets
Updated January 18, 2003 | Fredrik Lundh
The XML Infoset Model
The XML Information Set (Infoset) is an attempt to define a data model complete enough to represent anything that can be stored in an XML document. The infoset defines a number of “abstract” building blocks, such as elements, attributes and characters.
(It’s probably a good idea to see the infoset data model as the basis for the entire XML universe. The XML 1.0 specification simply describes a way to store infosets as byte streams, and standards like DOM and SAX provide programming interfaces. Or in other words, everything beyond the infoset proper are just implementation details.)
The XML infoset is a tree structure, with a Document Information Item as the “root” node. This node must contain exactly one Element Information Item, which in turn can contain a mix of child elements, character items, comments, and a couple of other element types. Each element can also contain a number of attribute nodes.
For an overview, see the XML Infoset page.
And yes, you can see the XPath data model as a variation of the Infoset. Each model can be defined in terms of the other.
A Simplified Infoset Model
The simplified model described here was first introduced in Secret Labs’ effDOM module, and was later made available in the xmltoys package. This model uses an infoset tree consisting of individual element nodes. Each node represents not only the start and end tags, but also XML attributes and text sections (as element attributes); there are no separate attribute or character data nodes in the tree.
This design is optimized for data-style XML documents, where each element contains either text sections (character data), or other elements, but not both.
The element type provides a basic sequence interface. To access the subelements, just iterate over the element. This will visit all subelements, in the order given in the source document.
To access element content, check the text member. If the element contains character data, this member contains a string.
For more information on the element API, see the Element Trees note.
Mixed Content #
Working with “text-style” documents is also easy. The main difference is that you usually have to make sure you include the tail member in the processing.
Consider the following XML fragment:
<ELEM key="value">text<SUBELEM />tail</ELEM>
This fragment results in two element instances. In a Python-like notation, their content looks like this:
element.tag = "ELEM" element.attrib = {"KEY": "VALUE"} element.text = "text" element.tail = None element[:] = [<Element SUBELEM>] subelement.tag = "SUBELEM" subelement.attrib = {} subelement.text = None subelement.tail = "tail" subelement[:] = []
The tail member for an element contains the text between that element’s end tag, and the next tag.