The Sgmlop Parser/Tokenizer
Updated May 28, 2003 | Fredrik Lundh
The sgmlop module is a fast replacement for the regular expression-based parsers used in the sgmllib/htmllib and xmllib module. A single module supports both SGML and XML.
The sgmlop parser is tolerant, and happily accepts XML-like data that are not well-formed. If you need strictness, use another parser. sgmlop is an excellent choice for applications that read human-authored content and wants to be fairly tolerant, and also for applications that read machine-generated XML in situations where it’s safe to trade standard compliancy for speed.
The current release is about 6 times faster than the original re-based implementation provided with Python 1.5, when using an xmllib/sgmllib-style interface. When using sgmlop directly, it can be more than 30 times faster.
If you want fast and fully compliant parsing, you can use the XMLParser component from the cElementTree module.
Downloads #
The latest version is available from the effbot.org downloads page.
Documentation #
- The Sgmlop Handbook
- Sgmlop Patterns (in progress)