Parsing RSS Files with ElementTree
Fredrik Lundh | July 2003 | Originally posted to online.effbot.org
Part 1: RSS 0.9x and RSS 2.0 #
Here’s a nifty little Element wrapper class that lets you use Python’s standard attribute access syntax to fetch character data from subelements:
class ElementWrapper: def __init__(self, element): self._element = element def __getattr__(self, tag): if tag.startswith("__"): raise AttributeError(tag) return self._element.findtext(tag)
(note that the wrapper returns None for missing attributes/subelements, unless the attribute name starts with two underscores)
For example, if feed is an element containing an RSS 2.0 (dead link) tree, the following code prints the title and link values for all items:
for item in feed.findall("channel/item"): item = ElementWrapper(item) print repr(item.title), item.link
Here’s a subclass that wraps an entire RSS tree. This class lets you iterate over the items, and use attribute access to fetch channel-level elements:
class RSSWrapper(ElementWrapper): def __init__(self, feed): channel = feed.find("channel") ElementWrapper.__init__(self, channel) self._items = channel.findall("item") def __getitem__(self, index): return ElementWrapper(self._items[index])
Finally, here’s a short script that uses the wrappers, plus Python’s urllib module, to fetch and parse an RSS feed:
from elementtree import ElementTree from urllib import urlopen URL = "http://online.effbot.org/rss.xml (dead link)" tree = ElementTree.parse(urlopen(URL)) feed = RSSWrapper(tree.getroot()) print "FEED", repr(feed.title) for item in feed: print "ITEM", repr(item.title), item.link
Errata #
Tony Mcdonald noticed that the RSS test program from the first element trick post didn’t work on his Python 2.3 install. The problem is that the for item in feed line gives a rather confusing attribute error:
Traceback (most recent call last): File "getfeed.py", line 50, in ? for item in feed: TypeError: 'NoneType' object is not callable
Adding a print statement to the __getattr__ method reveals why; the for-in statement attempts to call __iter__ to see if the sequence may be able to provide an iterator object. The wrapper looks for an __iter__ element in the RSS tree, notices that there is no such tree, and returns None. The for-in statement, in turn, thinks it has found an iterator factory, and attempts to call the None object to create an iterator.
Fixing this is straightforward; you can either change the wrapper to raise AttributeError exceptions for all suspicious attributes…
class ElementWrapper: def __init__(self, element, ns=None): self._element = element self._ns = ns or "" def __getattr__(self, tag): if tag.startswith("__"): raise AttributeError(tag) return self._element.findtext(self._ns + tag)
…or you can add an iterator factory to the RSSWrapper class, which is the class acting like a sequence:
class RSSWrapper(ElementWrapper): ... def __iter__(self): return iter([self[i] for i in range(len(self._items))])
It’s tempting to define the iterator factory to simply return iter(self._items) or even iter(self), but neither variant works. The former won’t wrap each item in an ElementWrapper instance, the latter results in runaway recursion, when iter() checks if the sequence happens to have an __iter__ attribute…
Part 2: RSS 1.0 #
The RSSWrapper class from part 1 supports RSS 0.9x and 2.0 feeds. There’s also a third RSS format, RSS 1.0, which is based on RDF, and stores the channel and item information in a slightly different fashion.
In RSS 1.0, the toplevel element is {http://www.w3.org/1999/02/22-rdf-syntax-ns#}RDF instead of rss, and all RSS 1.0 elements live in the {http://purl.org/rss/1.0/} namespace (for example, {http://purl.org/rss/1.0/}channel, {http://purl.org/rss/1.0/}title, etc).
And unlike the other RSS variants, the item elements are siblings to the channel element, not children.
To deal with namespaces, I’ve added a ns parameter to the ElementWrapper class. If provided, the parameter is used as a namespace prefix when accessing attributes:
class ElementWrapper: def __init__(self, element, ns=None): self._element = element self._ns = ns or "" def __getattr__(self, tag): if tag.startswith("__"): raise AttributeError(tag) return self._element.findtext(self._ns + tag)
In addition, the RSSWrapper class has to check the toplevel element, and set up the namespace and items list according to the actual RSS version:
# RSS 1.0 namespaces NS_RDF = "{http://www.w3.org/1999/02/22-rdf-syntax-ns#}" NS_RSS = "{http://purl.org/rss/1.0/}" class RSSWrapper(ElementWrapper): def __init__(self, feed): ns = None if feed.tag == NS_RDF + "RDF": # RSS 1.0 ns = NS_RSS # all RSS elements live in this namespace channel = feed.find(NS_RSS + "channel") items = feed.findall(NS_RSS + "item") else: # RSS 0.9x or 2.0 channel = feed.find("channel") items = channel.findall("item") ElementWrapper.__init__(self, channel, ns) self._items = items def __iter__(self): return iter([self[i] for i in range(len(self))]) def __len__(self): return len(self._items) def __getitem__(self, index): return ElementWrapper(self._items[index], self._ns)
Updated 2003-07-30: added __iter__ and __len__ hooks to make this work a bit better under Python 2.3.
It might be a good idea to add some more format checks; as it stands, the wrapper will treat any RDF file as a (most likely empty) RSS feed.
Checking for different types in the constructor works well as long as we only need to support two similar formats, but it’s not a very extensible design. here’s a straightforward refactoring, using separate wrapper classes for the two formats, and a factory function that loads the RSS file, and wraps it up in the right class:
from elementtree import ElementTree from urllib import urlopen NS_RDF = "{http://www.w3.org/1999/02/22-rdf-syntax-ns#}" NS_RSS = "{http://purl.org/rss/1.0/}" class ElementWrapper: def __init__(self, element, ns=None): self._element = element self._ns = ns or "" def __getattr__(self, tag): return self._element.findtext(self._ns + tag) class RSSWrapper(ElementWrapper): def __init__(self, channel, items, ns=None): self._items = items ElementWrapper.__init__(self, channel, ns) def __iter__(self): return iter([self[i] for i in range(len(self))]) def __len__(self): return len(self._items) def __getitem__(self, index): return ElementWrapper(self._items[index], self._ns) class RSS1Wrapper(RSSWrapper): def __init__(self, feed): RSSWrapper.__init__( self, feed.find(NS_RSS + "channel"), feed.findall(NS_RSS + "item"), NS_RSS ) class RSS2Wrapper(RSSWrapper): def __init__(self, feed): channel = feed.find("channel") RSSWrapper.__init__( self, channel, channel.findall("item") ) def getfeed(path): tree = ElementTree.parse(urlopen(path)) feed = tree.getroot() if feed.tag == NS_RDF + "RDF": return RSS1Wrapper(feed) if feed.tag == "rss": return RSS2Wrapper(feed) raise IOError("unknown feed format") # try it out feed = getfeed("http://online.effbot.org/rss.xml") print "FEED", repr(feed.title) for item in feed: print "ITEM", repr(item.title), item.link
Part 3: RSS 0.9 #
I keep forgetting that there’s a fourth RSS format out there: the original RSS 0.90 format (dead link), created for the my.netscape.com portal back in the early dark ages. very few sites use this format today, but you can for example subscribe to an RSS 0.90 feed from slashdot.
Like 1.0, the 0.90 format is based on RDF, but it uses a different namespace for the RSS elements. to distinguish between the formats, you can check the namespace of the channel element:
NS_RSS_09 = "{http://my.netscape.com/rdf/simple/0.9/}" NS_RSS_10 = "{http://purl.org/rss/1.0/}" def getfeed(path): tree = ElementTree.parse(urlopen(path)) feed = tree.getroot() if feed.tag == NS_RDF + "RDF": # check the namespace of the first channel tag for elem in feed: if elem.tag.endswith("channel"): if elem.tag.startswith(NS_RSS_09): return RSS0Wrapper(feed) if elem.tag.startswith(NS_RSS_10): return RSS1Wrapper(feed) elif feed.tag == "rss": return RSS2Wrapper(feed) raise IOError("unknown feed format")
To implement the RSS0Wrapper class, you can simply make a copy of RSS1Wrapper and change the namespace.
Or you can factor out the initialization code into a common base class, like this:
class RDFWrapper(RSSWrapper): def __init__(self, feed, ns): RSSWrapper.__init__( self, feed.find(ns + "channel"), feed.findall(ns + "item"), ns ) class RSS0Wrapper(RDFWrapper): def __init__(self, feed): RDFWrapper.__init__(self, feed, NS_RSS_09) class RSS1Wrapper(RDFWrapper): def __init__(self, feed): RDFWrapper.__init__(self, feed, NS_RSS_10)
You can also get rid of the individual wrappers, and create RDFWrapper instances in the factory function:
def getfeed(path): ... if elem.tag.startswith(NS_RSS_09): return RDFWrapper(feed, NS_RSS_09) if elem.tag.startswith(NS_RSS_10): return RDFWrapper(feed, NS_RSS_10)
The drawback with this approach is that you cannot check the instance type to see what kind of feed you have (this might be seen as an advantage, by some), and there’s no place to put version-specific code if/when you need to extend the wrapper interface.
Part 4: Pea/Echo/Atom #
Note: This was written when Atom was still being developed; the code here probably won’t work with the final Atom version.
Talking about formats, there’s actually a fifth RSS format being developed as we speak. it’s not called RSS, though: instead, it’s been known as Pie, Echo, and is currently called Atom. (and chances are that they’ve changed both the name and the format since I started writing this article ;-)
Anyway, at the moment, Pie/Echo/Atom (PEA) feeds are similar to RSS feeds; they contain information about the source, and a list of items, each consisting of a title, a link, a description, and auxiliary data such as publication dates and identifiers. however, most elements have been renamed and/or redesigned. for example, the PEA toplevel element is named feed, items are stored in entry elements, and PEA uses XML attributes instead of character data in lots of places.
As in the RDF-based RSS formats, PEA elements live in a namespace (usually {http://purl.org/echo/}, but other namespaces may be used, at least in experimental feeds).
To check for a PEA feed, you can look for a feed toplevel element:
def getfeed(path): ... if feed.tag.endswith("feed"): return PEAWrapper(feed)
To be able to deal with experimental PEA feeds, the wrapper will have to extract the namespace from the toplevel element. Also note that the items list contains entry elements:
class PEAWrapper(RSSWrapper): def __init__(self, feed): ns = feed.tag[:feed.tag.index("}")+1] RSSWrapper.__init__( self, feed, feed.findall(ns + "entry"), ns )
If you add this code to the test program, and run it on a PEA feed (e.g. the one at joelonsoftware (dead link)), you’ll find that the program prints the expected titles, but that the links are all empty (or set to None).
The reason for this is that the link element uses an attribute to hold the actual URL. To extract the link, you need to get your hands at the actual element, and use get to fetch the attribute value.
An easy way to do this is to use a custom ElementWrapper subclass to represent PEA entries, and treat the link element separately. here’s a first version:
class PEAEntryWrapper(ElementWrapper): def __getattr__(self, tag): if tag == "link": # return the href attribute for the first link element elem = self._element.find(self._ns + tag) if elem is None: return None return elem.get("href") return ElementWrapper.__getattr__(self, tag) class PEAWrapper(RSSWrapper): def __init__(self, feed): # extract namespace from toplevel element ns = feed.tag[:feed.tag.index("}")+1] RSSWrapper.__init__( self, feed, feed.findall(ns + "entry"), ns ) def __getitem__(self, index): return PEAEntryWrapper(self._items[index], self._ns)
I’ll get back to PEA in a later element tricks article.