Implementing Pull-Style XML Parsers
May 4, 2000 | Fredrik Lundh
This xml-dev posting inspired Paul Prescod to implement the xml.dom.pulldom module, which was later added to Python’s standard library. A similar technique is used in ElementTree’s iterparse API.
Q. I’m not sure how we would actually implement this. The only XML parser we have that supports a pull-style interface is RXP, and I’m not sure if we can convert the other interfaces to pull-style interfaces in a sensible way (at least not on a level as low as SAX) without storing the entire sequence of events.
Assuming that a pull-style parser is what I think it is, here’s how to convert any incremental parser (xmllib, sgmlop, expat, etc) to a pull-style parser:
import xmllib START, DATA, END = "start", "data", "end" class XMLPuller(xmllib.XMLParser): def __init__(self, stream): xmllib.XMLParser.__init__(self) self.__stream = stream self.__tokens = [] def get(self): while not self.__tokens: data = self.__stream.read(10000) if not data: self.close() break self.feed(data) if self.__tokens: return self.__tokens.pop(0) return None # end of stream def unknown_starttag(self, tag, attr): self.__tokens.append(START, tag, attr) def handle_data(self, data): self.__tokens.append(DATA, data) def unknown_endtag(self, tag): self.__tokens.append(END, tag) puller = XMLPuller(open("myfile.xml")) while 1: next = puller.get() if not next: break print next
[source]