The TidyHTMLTreeBuilder parser can read (almost) arbitrary HTML files, and turn them into well-formed element trees. This parser uses a library version of Dave Raggett’s HTML Tidy utility to fix any problems with the HTML before converting it to XHTML (the XML version of HTML).
Note: If you don’t want to (or cannot) install binary Python extensions, you can use the TidyTools module in the standard ElementTree distribution. That module uses the command-line version of Tidy, which is available for many different platforms.
This tree builder requires the _elementtidy extension, which is based on the tidylib library. Note that this extension is not included in the current elementtree releases, but you can download a separate elementtidy package from codespit.com downloads site.
Usage
Loading HTML Files
To load an HTML file into an XHTML tree, import the TidyHTMLTreeBuilder module and call the parse method:
from elementtidy import TidyHTMLTreeBuilder tree = TidyHTMLTreeBuilder.parse("myfile.htm")
Note: In the experimental alpha releases, the tree builder is installed in the elementtidy package. If you’re using a version shipped with the ElementTree library, import the module from the elementtree package instead.
Converting XHTML to HTML
The ElementTree interfaces convert the HTML to the XML version of HTML, called XHTML. In this format, all HTML tags live in the {http://www.w3.org/1999/xhtml} namespace. The following code snippet shows how to ‘normalize’ the tree, turning it into standard HTML:
XHTML = "{http://www.w3.org/1999/xhtml}" for elem in tree.getiterator(): if elem.tag.startswith(XHTML): elem.tag = elem.tag[len(XHTML):]
Saving HTML Files
To save a plain HTML file, just write out the tree.
tree.write("outfile.htm")
This works well, as long as the file doesn’t containg any embedded SCRIPT or STYLE tags.
If you want, you can add a DTD reference to the beginning of the file:
file = open("outfile.htm", "w") file.write(DTD + "\n") tree.write(file) file.close()
Saving XHTML Files
If you save an XHTML file (where each tag lives in the XHTML namespace), the write method will add a namespace declaration to the html element, and place every tag in an explicit namespace. Some browsers can’t handle this, and may fail to render your document properly.