XML Infosets
Updated July 18, 2002 | May 1, 2002 | Fredrik Lundh
The XML Infoset Model
The XML Information Set (Infoset) is an attempt to define a data model complete enough to represent anything that can be stored in an XML document. The infoset defines a number of “abstract” building blocks, such as elements, attributes and characters.
(It’s probably a good idea to see the infoset data model as the basis for the entire XML universe. The XML 1.0 specification simply describes a way to store infosets as byte streams, and standards like DOM and SAX provide programming interfaces. Or in other words, everything beyond the infoset proper are just implementation details.)
The XML infoset is a tree structure, with a Document Information Item as the “root” node. This node must contain exactly one Element Information Item, which in turn can contain a mix of child elements, character items, comments, and a couple of other element types.
What follows is a somewhat simplified overview of the infoset, given as a number of Python classes, with brief annotations.
The Document Root Node
The root node contains information about the document itself, as well as a list of toplevel nodes. The most important node is the document element, which is available both in the child list, and via the document_element attribute. The child list can also contain comments and processing instructions, but not multiple elements:
class DocumentInformationItem: def __init__(self): self.children = [] # child items, in document order self.document_element = None # root element item self.base_uri = ""
Element Nodes
Each element is represented by a separate information item. The element contains a reference to its parent (either the document root, or another element) and an ordered list of child elements.
It also contains an unordered collection of attribute item (represented by a dictionary in the code sample below). Each attribute item maps a name to a value.
Element and attribute names consist of three parts; the local name, and optional namespace and prefix strings. A complete name is formed by the (namespace, local name) tuple. The prefix is used in the XML file, and shouldn’t be used to identify elements and attributes in the infoset.
class ElementInformationItem: def __init__(self): self.parent = None self.children = [] # child items, in document order self.attributes = {} # contains attribute information items self.name = "" self.namespace = None # namespace uri self.prefix = None # namespace prefix class AttributeInformationItem: def __init__(self): self.value = "" self.name = "" self.namespace = None self.prefix = None
Character Data
Conceptually, character data is stored as a number of character items. Each item stores a single character, whether it comes from literal text given between start and end tags, a character reference, or the content of a CDATA element.
class CharacterInformationItem: def __init__(self): self.parent = None # parent element self.code = 0 # iso 10646 character code
In practice, any reasonable implementation would probably store runs of characters as strings, in a single node object:
class CharacterInformationItem: def __init__(self): self.parent = None # parent element self.characters = u""
Still, the infoset does not care about how the text was stored in the XML file. Even if there are more than one consecutive character information node in an element list, you cannot assume that they come from separate syntactical elements in the XML file.
Additional Nodes
Finally, the infoset can contain additional nodes representing various constructs that can be used in an XML file, such as comments and processing instructions.
class CommentInformationItem: # <!-- content --> def __init__(self): self.parent = None # parent element self.content = "" # comment contents class ProcessingInstructionInformationItem: # <?target content?> def __init__(self): self.parent = None # parent element self.target = "" self.content = ""
See the Infoset specification for a full list of additional node types.
The Python Document Object Model
Coming soon.