The Sgmlop Module Handbook
April 4, 1998 | Fredrik Lundh
Overview
The sgmlop module provides a simple and fast parser/lexer for reading XML documents. By design, the parsers provided by this module are very tolerant. They will parse virtually anything into a stream of start tags, entities, end tags, and text sections. If you need careful well-formedness checking, use expat instead.
The sgmlop module can also be used to parse SGML and HTML documents.
Concepts
FIXME: to be added
Patterns
FIXME: to be added
class xml_handler: def finish_starttag(self, tag, attrs): ... def finish_endtag(self,tag): ... def handle_data(self,data): ... parser = sgmlop.XMLParser() target = xml_handler() parser.register(target) while 1: data = file.read(8192) if not data: break parser.feed(data) parser.close()
FIXME: cookbook: how to use entity resolvers
FIXME: cookbook: how to count lines
FIXME: cookbook: how to parse unicode strings
FIXME: cookbook: how to parse dtd
FIXME: cookbook: how to parse external entities
Classes
- XMLParser()
-
Create an XML parser.
- SGMLParser()
-
Create an SGML parser.
Parser Methods
- register(target)
-
Register a parser target object. This method looks up a number of target methods in this object, and registers them with the parser.
For a list of target methods used by this method, see the target interface description below.
Note that if you use the standard pattern where a parser class holds a reference to the sgmlop object, and you’ll register methods in the same class, Python may leak resources. To avoid this, you can either remove the object from the class before you destroy the class instance, or unregister all methods (by calling register(None), or both. Recent versions of sgmlop supports proper garbage collection for this situation, but it never hurts to be on the safe side.)
- feed(string)
-
Feed a string (or string buffer) to the parser.
- close()
-
Flush the parser buffers, and shut down the parser. This method should always be called after the last call to feed, to make sure all data has been returned.
This method also releases references to registered handler methods. To avoid memory leaks caused by cyclical references, you must call this method when the parsing is finished.
- parse(string)
-
Same as feed followed by a close. Don’t mix this method with feed and close; either call this method once for the entire document, or use feed/close to parse your document piece by piece.
Target Interface
The target object can implement one or more of the following methods. A typical target object should implement at least finish_starttag, finish_endtag, and handle_data.
- finish_starttag(tag, attrib)
-
Handle a start tag. The XML parser represents attributes as a dictionary, the SGML parser as a list of (key, value)-tuples.
- finish_endtag(tag)
-
Handle an end tag.
- handle_proc(target, content)
-
Handle a processing instruction. If omitted, processing instructions are ignored.
- handle_special(content)
-
Handle a special element, including the special elements that make up an internal DTD. If omitted, special elements are ignored.
- handle_charref(ref)
-
Handle a decimal or hexadecimal character reference. You usually don’t have to define this method; if it’s not defined, the parser will convert the entity to a character string, and pass it to the handle_data method.
- handle_entityref(ref)
-
Handle a named entity reference in character data. If present, this method is called also for standard entities (gt, amp, etc), and for malformed character entities.
If not defined, the parser resolves internal entities by itself, and uses the resolve_entity method for other entities. The resulting string is then passed to the handle_data methods instead.
If an entity cannot be resolved, it is ignored, unless running in strict mode.
- resolve_entityref(ref) => string or None
-
Resolve a named entity reference. This is used for entities in attribute values, and also for character data, if handle_entityref is not defined.
If successful, this method should return a character string. If the entity should be resolved, return None. Otherwise, the method should raise a suitable exception.
- handle_data(text)
-
Handle character data.
- handle_cdata(text)
-
Handle a CDATA section. If not defined, the character contents are passed to the handle_data method instead.
- handle_comment(text)
-
Handle an XML comment. If not defined, comments are ignored.