The ElementSoup Module

The ElementSoup module is a (slightly experimental) wrapper for Leonard Richardson’s robust BeautifulSoup HTML parser, which turns the BeautifulSoup data structure into an element tree. The resulting combo is similar to ElementTidy, but a lot less picky. And therefore, a lot more practical. Which is good.

Code (latest versions):

ElementSoup.py
BeautifulSoup.py

Just grab the files and put them in your project directory, or at least somewhere on your Python path.

Usage:

You can use the parse function to quickly parse a file:

import ElementSoup

html = ElementSoup.parse("document.html")
# html is an Element instance

for header in html.findall(".//h1"):
    print repr(header.text)

In this example, the html and header objects are instances of the Element class from the ElementTree module. For more on these classes, see The ElementTree Module.

You can also pass in a file-like object. For example, you can parse remote HTML pages by passing in a urllib HTTP stream:

import ElementSoup
import urllib, urlparse

root = "http://www.python.org"

html = ElementSoup.parse(urllib.urlopen(root))

for anchor in html.findall(".//a"):
    href = urlparse.urljoin(root, anchor.get("href"))
    if not href.startswith(root):
        print href # external link

By default, ElementSoup picks the “best” ElementTree implementation it can find. If you want to control what implementation to use, set the ET module attribute before you call the parse method:

import ElementSoup
import cElementTree

ElementSoup.ET = cElementTree

tree = ElementSoup.parse(filename)

Since an element tree can have only one root element, ElementSoup adds a toplevel html element if the document doesn’t already have one. For example,

<h1>Title</h1><p>Paragraph.

is turned into

<html><h1>Title</h1><p>Paragraph.</p></html>