Using ElementTrees for Pull-Style Parsing
April 25, 2003 | Fredrik Lundh
Note: In recent versions of ElementTree, the iterparse interface provides a more convenient (and faster) way to do this. See The ElementTree iterparse Function for more information and examples.
“ Do you need all data DOMmed at once? You may be able to have one DOM tree at a time, dropping and reloading everytime you switch file. “
An alternative is to use an incremental tree builder, and process interesting subtrees as they arrive.
Here’s an example, using the elementtree module:
from elementtree import ElementTree class MyBuilder(ElementTree.TreeBuilder): def end(self, tag): elem = ElementTree.TreeBuilder.end(self, tag) if elem.tag == "SCENE": # process(elem) in some way, and write it out, e.g. # ElementTree.ElementTree(elem).write(sys.output) elem.clear() # we're done with it parser = ElementTree.XMLTreeBuilder() parser._target = MyBuilder() # plug in a custom builder! tree = ElementTree.parse(filename, parser)
The above example overrides the tree builder’s end method, looking for SCENE elements.
I’ve tested this with a 10 megabyte XML file created by concatenating Jon Bosak’s Hamlet XML file over and over again, and wrapping it all in a single document element.
The resulting file contains 720 scenes (about 15k each, in average).
The above script requires about 4.5 megabytes to run to completion, and about 2 minutes processing time (on a really slow machine).
If I comment out the elem.clear() call, the script requires about 75 megabytes, and about 15 minutes (13 of which were spent on swapping; I ran the test on a machine with 96 megabytes RAM and slow disks… ;-)