Idea: XML Literals For Python

Updated October 10, 2002 | September 24, 2002 | Fredrik Lundh

“…looks interesting in a futuristic kind of way.” — GvR

I propose adding support for XML Literals to Python, which can be used to embed XML fragments in Python source code. What follows is a brief sketch; it should not be seen as a complete proposal.

Syntax #

The XML literal syntax is a subset of the full XML 1.0 syntax.

An XML literal consists of a start tag (<tag>), optional XML content, and a matching end tag (</tag>). You can use XML literals everywhere an ordinary literal can be used. XML literals may span over multiple lines, just as triple-quoted strings. The content may consist of literal text, predefined XML entities, numerical character references, comments, or other tags. Nested tags must be properly balanced.

Examples:

fragment1 = <element>some content</element>

fragment2 = <element>
    <!-- a comment -->
    some content
    <child attribute='value'>
      spam &amp; egg
    </child>
  </element>

or, alternatively:

fragment1 = <element> "some content" </element>

fragment2 = <element>
    <!-- a comment -->
    "some content"
    <child attribute='value'>
      u"spam & egg"
    </child>
  </element>

Internal Representation #

(preliminary)

XML literals are parsed into a tree-like structure consisting of standard Python objects only. Each element is turned into a tuple consisting of the tag name, an optional dictionary containing the attributes, and a sequence of text strings and child tuples. Text segments that contain only plain ASCII characters are stored in 8-bit strings, other segments are stored as Unicode strings. The source code encoding is used also for the XML content. XML comments are ignored.

The above examples correspond to:

fragment1 = ("element", None, "some content")

fragment2 = ("element", None, " some content\n ",
    ("child", {"attribute": "value"}, " spam & egg\n "))

This can be trivially turned into a SAX-like event stream:

def generate(target, node):
    target.start(node[0], node[1] or {})
    for item in node[2:]:
        if isinstance(item, tuple):
            generate(target, item)
        else:
            target.data(item)
    target.end(node[0])

It is assumed that popular DOM-like implementations will provide convenience functions to wrap tuple structures in DOM trees:

from mydom import xml

fragment = xml(<element>some content</element>)

It is also assumed that DOM-like implementations will support operations on tuples:

dom = mydom.parse(file)
dom.root += <element>text</element>

Open Issues #

(or “smart questions, stupid answers”. to be updated)

Q. This requires a context-sensitive tokenizer, right?

(most likely. unless we restrict the syntax on what can be handled by the existing tokenizer. to make that work, users must use string literals for textual content:

fragment1 = <element>"some content"</element>

fragment2 = <element>
    <!-- a comment -->""" some content """<child attribute='value'>""" spam & egg """</child>
  </element>

not very pretty, if you’re asking me.)

Q. Why not pick an existing DOM implementation and stick to it?

(they all suck)

Q. Or at least use a standard node type instead of tuples?

(a tuple subclass, probably)

Q. How to deal with namespaces?

(using prefixes, as usual)

Q. What about DTDs, CDATA sections, processing instructions, etc?

(you aint gonna need it)

Q. What about dynamically generated content? Can we add syntax allowing us to evaluate expressions inside an XML fragment?

(any ideas?)

Q. What about HTML? Can this be used as a template language?

(sure, as long as you don’t mind using xhtml in the source code)

Q. Why bother? XML’s just a fad anyway.

(really?)

Implementation #

There is no implementation, at the moment. It would be pretty easy to extract bits and pieces from sgmlop, and add them to Python’s tokenizer and code generator. Maybe some rainy day…