The EffNews Project: Building an RSS Newsreader
September 2002
“The tiny diamond-tipped pen shivered and twitched like one insane, and it seemed to Pugg that any minute now he would learn the most fabulous, unheard-of-things, things that would open up to him the Ultimate Mystery of Being, so he greedily read everything that flew out from under the diamond nib… the sizes of bedroom slippers available on the continent of Cob, with pompons and without… And the average width of the fontanel in indigenous step infants… and the inaugural catcalls of the Duke of Zilch, and six ways to cook cream of wheat… and the names of all the citizens of Foofaraw Junction beginning with the letter M, and the results of a poll of opinions on the taste of beer mixed with mushroom syrup…”— The Cyberiad, by Stanislaw Lem
Introducing the EffNews Project
This effbot.org project aims to build a simple RSS-based newsreader (aka “aggregator”) with a graphical user interface front-end, similar to applications like Headline Viewer and NetNewsWire (dead link). The reader will be based on standard Python cross-platform tools, which means that it will run on Windows, Unix (including Linux), and hopefully also Macintosh.
The RSS file format is an XML-based file format that provides a “site summary”, that is, a brief summary of information published on a web site. It’s usually used to provide a machine readable version of the contents on a news site or a weblog.
- Part 1. Fetching RSS Files
- Part 2. Fetching and Parsing RSS Files
- Part 3. Displaying RSS Files
- Part 4. Parsing More RSS Files
- Part 5. Odds and Ends (In Progress)
- Part 6. Using the ElementTree Module to Parse RSS Files (In Progress)
“And it grew dark before his hundred eyes, and he cried out in a mighty voice that he’d had enough, but Information had so swathed and swaddled him in its three hundred thousand tangled paper miles that he couldn’t move and had to read on about how Kipling would have written the beginning to his second Jungle Book if he had had indigestion just then, and what thoughts come to unmarried whales… and why we don’t capitalize paris in the plaster of paris.”
EffNews Part 1: Fetching RSS Files #
RSS Files #
The RSS file format is an XML-based file format that provides a “site summary”, that is, a brief summary of information published on a site. It’s usually used to provide a machine readable version of the contents on a news site or a weblog.
Depending on who you talk to, RSS means “Rich Site Summary“, “RDF Site Summary” or “Really Simple Syndication (dead link)” (or perhaps “Really Small Something“). It was originally created by Netscape for use on their my.netscape.com site, and was later developed into two similar but slightly differing versions, RSS 0.9x/2.0 (dead link) and RSS 1.0.
An RSS 0.9x file might look something like this:
<?xml version="1.0"?> <rss version="0.91"> <channel> <title>the eff-bot online</title> <link>http://online.effbot.org</link> <description>Fredrik Lundh's clipbook.</description> <language>en-us</language> ... <item> <title>spam, spam, spam</title> <link>http://online.effbot.org#85292735</link> <description>for the first seven months of 2002, the spam filters watching fredrik@pythonware.com has</description> </item> ... </channel> </rss>
The content consists of some descriptive information (the site’s title, a link to an HTML rendering of the content, etc) and a number of item elements, each of which contains an item title, a link, and a (usually brief) description.
We’ll look into RSS parsing and other RSS formats in later articles. For now, we’re more interested in getting our hands on some RSS files to parse…
Using HTTP to Download Files #
Like all other resources on a web, an RSS file is identified by a uniform resource locator (URI). A typical RSS URI might look something like:
http://online.effbot.org/rss.xml(dead link)
To fetch this RSS file, the aggregator connects to the computer named online.effbot.org and issues an HTTP request, asking the server to return the document identified as /rss.xml.
Here’s a minimal HTTP request message that does exactly this:
GET /rss.xml HTTP/1.0 Host: online.effbot.org
The message should be followed by an empty line.
If everything goes well, the HTTP server responds with a status line, followed by a number of header lines, an empty line, and the RSS file itself:
HTTP/1.1 200 OK Last-Modified: Tue, 03 Sep 2002 11:04:09 GMT ETag: "1e49dc-dfa-3d749729" Content-Length: 3578 Content-Type: text/xml Connection: close ...RSS data...
Sending an HTTP request #
Python makes it easy to issue HTTP requests. Here’s an example that uses the socket module, which is a low-level interface for network communication:
HOST = "online.effbot.org" PATH = "/rss.xml" import socket sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) sock.connect((HOST, 80)) sock.send("GET %s HTTP/1.0\r\nHost: %s\r\n\r\n" % (PATH, HOST)) while 1: text = sock.recv(2048) if not text: break print "read", len(text), "bytes" s.close()
The socket.socket call creates a socket for the INET (internet) network, and of the STREAM (reliable byte stream) type. This is more commonly known as a TCP connection.
The connect method is used to connect to a remote computer. The method takes a tuple containing two values; the computer name and the port number to use on that computer. In this example, we’re using port 80 which is the standard port for HTTP.
The send method is used to send the HTTP request to the server. Note that lines are separated by both a carriage return (\r) and a newline (\n), and that there’s an extra empty line at the end of the request.
The recv method, finally, is used to read data from the socket. Like the standard read method, it returns an empty string when there’s no more data to read.
Using an HTTP support library #
In addition to the low-level socket module, Python’s standard library comes with modules that support common network protocols, including HTTP. The most obvious choice, httplib is an intermediate-level library which provides only a thin layer on top of the socket library.
The urllib module provides a higher-level interface. It takes an URL, generates a full HTTP request, parses the response header, and returns a file-like object that can be used to read the rest of the response right off the server:
import urllib file = urllib.urlopen("http://" + HOST + PATH) text = file.read() print "read", len(text), "bytes"
Asynchronous HTTP #
A problem with both the low-level socket library and urllib is that you can only read data from one site at a time. If you use sockets, the connect and recv calls may block, waiting for the server to respond. If you use urllib, both the urlopen and the read methods may block for the same reason.
If the task here was to create some kind of batch RSS aggregator, the easiest solution would probably be to ignore this problem, and read one site at a time. Who cares if it takes one second or ten minutes to check all channels; it would take much longer to visit all the sites by hand anyway.
However, in an interactive application, it’s rather bad style to block for an unknown amount of time. The application must be able to download things in the background, without locking up the user interface.
There are a number of ways to address this (including background processes and threads), but in this project, we’ll use something called asynchronous sockets, as provided by Python’s asyncore module.
The asyncore module provides “reactive” sockets, meaning that instead of creating socket objects, and calling methods on them to do things, your code is called by the socket framework when something can be done. This approach is known as event-driven programming.
The asyncore module contains a basic dispatcher class that represents a reactive socket. There’s also an extension to that class called dispatcher_with_send, which adds buffered output.
For the HTTP client, all you have to do is to subclass the dispatcher_with_send class, and implement the following methods:
-
handle_connect is called when a connection is successfully established.
-
handle_expt is called when a connection fails (Windows only. On most other platforms, connection failures are indicated by errors when writing to, or reading from the socket).
-
handle_read is called when there are data waiting to be read from the socket. The callback should call the recv method to get the data.
-
handle_close is called when the socket is closed or reset.
Here’s a first version:
import asyncore import string, socket class async_http(asyncore.dispatcher_with_send): # asynchronous http client def __init__(self, host, path): asyncore.dispatcher_with_send.__init__(self) self.host = host self.path = path self.header = None self.data = "" self.create_socket(socket.AF_INET, socket.SOCK_STREAM) self.connect((host, 80)) def handle_connect(self): # connection succeeded; send request self.send( "GET %s HTTP/1.0\r\nHost: %s\r\n\r\n" % (self.path, self.host) ) def handle_expt(self): # connection failed self.close() def handle_read(self): # deal with incoming data data = self.recv(2048) if not self.header: # check if we have a full header self.data = self.data + data try: i = string.index(self.data, "\r\n\r\n") except ValueError: return # no empty line; continue self.header = self.data[:i+2] print self.host, "HEADER" print print self.header data = self.data[i+4:] self.data = "" if data: print self.host, "DATA", len(data) def handle_close(self): self.close()
The constructor creates a socket, and issues a connection request. Unlike ordinary sockets, the asynchronous connect method returns immediately; the framework calls the handle_connect method once it’s finished. When this method is called, our class immediately issues an HTTP request for the given RSS file. The framework makes sure that the request is sent as soon as the network is ready.
When the remote computer gets the request, it returns a response message. As data arrives, the handle_read method is called over and over again, until there’s no more data to read. Our handle_read method starts by looking for the header section (or rather, the empty line that identifies the end of the header). After that, it simply prints DATA messages to standard output.
Let’s try this one out on a real site:
$ python >>> from minimal_http_client import async_http >>> async_http("online.effbot.org", "/rss.xml") <async_http at 880294> >>> import asyncore >>> asyncore.loop() online.effbot.org HEADER HTTP/1.1 200 OK Server: Apache/1.3.22 (Unix) Last-Modified: Tue, 03 Sep 2002 11:04:09 GMT ETag: "1e49dc-dfa-3d749729" Content-Length: 3578 Content-Type: text/xml Connection: close online.effbot.org DATA 1139 online.effbot.org DATA 2048 online.effbot.org DATA 391
To issue a request, just create an instance of the async_http class. The instance registers itself with the asyncore framework, and all you have to do to run it is to call the asyncore.loop function.
The real advantage here is that you can issue multiple requests at once…
>>> async_http("www.scripting.com", "/rss.xml") <async_http at 8da7a4> >>> async_http("online.effbot.org", "/rss.xml") <async_http at 8daf34> >>> async_http("www.bbc.co.uk", ... "/syndication/feeds/news/ukfs_news/front_page/rss091.xml") <async_http at 8db364> >>> asyncore.loop()
…and have the framework process all requests in parallel:
online.effbot.org HEADER ... online.effbot.org DATA 1139 online.effbot.org DATA 2048 online.effbot.org DATA 391 www.scripting.com HEADER ... www.scripting.com DATA 1189 www.scripting.com DATA 1460 www.bbc.co.uk HEADER ... www.bbc.co.uk DATA 1766 www.bbc.co.uk DATA 712 www.scripting.com DATA 1460 www.scripting.com DATA 1460 www.scripting.com DATA 1158
(Actual headers omitted.)
The actual output may vary depending on your network connection, the servers, and the phase of the moon.
To get a bit more variation, put the above statements in a script and run the script a couple of times.
Storing the RSS Data #
The code we’ve used this far simply prints information to the screen. Before moving on to parsing and display issues, let’s add some code to store the RSS data on disk.
The following version adds support for a consumer object, which is called when we’ve read the header, when data is arriving, and when there is no more data. A consumer should implement the following methods:
-
http_header(client) is called when we’ve read the HTTP header. It’s called with a reference to the client object, and can use attributes like status and header to inspect the response header.
-
http_failed(client) is similar to http_header, but is called if the framework fails to connect to the remote computer.
-
feed(data) is called when a number of bytes has been read from the remote computer, after the header has been read.
-
close() is called when there is no more data.
In addition to consumer support, the following code uses the mimetools module to parse the header into a dictionary-like structure, adds counters for incoming and outgoing data, and uses a factory method that knows how to pull an URL into pieces.
import asyncore import socket, time import StringIO import mimetools, urlparse class async_http(asyncore.dispatcher_with_send): # asynchronous http client def __init__(self, host, port, path, consumer): asyncore.dispatcher_with_send.__init__(self) self.host = host self.port = port self.path = path self.consumer = consumer self.status = None self.header = None self.bytes_in = 0 self.bytes_out = 0 self.data = "" self.create_socket(socket.AF_INET, socket.SOCK_STREAM) self.connect((host, port)) def handle_connect(self): # connection succeeded text = "GET %s HTTP/1.0\r\nHost: %s\r\n\r\n" % (self.path, self.host) self.send(text) self.bytes_out = self.bytes_out + len(text) def handle_expt(self): # connection failed; notify consumer self.close() self.consumer.http_failed(self) def handle_read(self): data = self.recv(2048) self.bytes_in = self.bytes_in + len(data) if not self.header: # check if we've seen a full header self.data = self.data + data header = self.data.split("\r\n\r\n", 1) if len(header) <= 1: return header, data = header # parse header fp = StringIO.StringIO(header) self.status = fp.readline().split(" ", 2) self.header = mimetools.Message(fp) self.data = "" self.consumer.http_header(self) if not self.connected: return # channel was closed by consumer if data: self.consumer.feed(data) def handle_close(self): self.consumer.close() self.close() def do_request(uri, consumer): # turn the uri into a valid request scheme, host, path, params, query, fragment = urlparse.urlparse(uri) assert scheme == "http", "only supports HTTP requests" try: host, port = host.split(":", 1) port = int(port) except (TypeError, ValueError): port = 80 # default port if not path: path = "/" if params: path = path + ";" + params if query: path = path + "?" + query return async_http(host, port, path, consumer)
Here’s a small test program that uses the enhanced client and a “dummy” consumer class:
import http_client, asyncore class dummy_consumer: def http_header(self, client): self.host = client.host print self.host, repr(client.status) def http_failed(self, client): print self.host, "failed" def feed(self, data): print self.host, len(data) def close(self): print self.host, "CLOSE" URLS = ( "http://online.effbot.org/rss.xml", "http://www.scripting.com/rss.xml", "http://www.bbc.co.uk/syndication/feeds" + "/news/ukfs_news/front_page/rss091.xml", "http://www.example.com/rss.xml", ) for url in URLS: http_client.do_request(url, dummy_consumer()) asyncore.loop()
Here’s some sample output from this test program. Note the 404 error code from the example.com site.
online.effbot.org ['HTTP/1.1', '200', 'OK\r\n'] online.effbot.org 1139 online.effbot.org 1460 online.effbot.org 979 online.effbot.org CLOSE www.bbc.co.uk ['HTTP/1.1', '200', 'OK\r\n'] www.bbc.co.uk 1766 www.bbc.co.uk 711 www.scripting.com ['HTTP/1.1', '200', 'OK\r\n'] www.scripting.com 1189 www.bbc.co.uk CLOSE www.scripting.com 1460 www.example.com ['HTTP/1.1', '404', 'Not Found\r\n'] www.example.com 269 www.example.com CLOSE www.scripting.com 1460 www.scripting.com 1460 www.scripting.com 1158 www.scripting.com CLOSE
To store things on disk, replace the dummy with a version that writes data to a file:
class file_consumer: def http_header(self, client): self.host = client.host self.file = None def http_failed(self, client): pass def feed(self, data): if self.file is None: self.file = open(self.host + ".rss", "w") self.file.write(data) def close(self): if self.file is not None: print self.host + ".rss ok" self.file.close() self.file = None
If you modify the test program to use this consumer instead of the dummy version, it’ll print something like this:
online.effbot.org.rss ok www.example.com.rss ok www.bbc.co.uk.rss ok www.scripting.com.rss ok
Three of the four files contain current RSS data. The fourth (from example.com) contains an HTML error message. To avoid storing error messages, it’s probably a good idea to let the consumer check the status field as well as the Content-Type header field. You can do this in the http_header method:
class file_consumer: def http_header(self, client): if (client.status[1] != "200" or client.header["content-type"] != "text/xml"): print client.host, "failed" client.close() # bail out client.connected = 0 return self.host = client.host self.file = None ...
Note that consumer can simply call the client’s close method to shut down the connection. The client contains code that checks that it’s still connected after the http_header call, and avoids calling other consumer methods if it’s not.
Update 2002-09-08: not all versions of asyncore clears the connected attribute when the socket is closed. For example, the version shipped with Python 1.5.2 does, the version shipped with 2.1 doesn’t. To be on the safe side, you have to clear the flag yourself in the consumer.
:::
That’s all for today. In the next article, we’ll look at how to parse at least some variant of the RSS format into a more useful data format.
While waiting, feel free to play with the code we’ve produced this far. Also, don’t forget to take a look at the RSS data files we just downloaded. Mark Nottingham’s RSS tutorial contains links to more information on various RSS formats.
EffNews Part 2: Fetching and Parsing RSS Data #
Intermission: Did Anyone Spot The Error Message? #
As some of you may have noticed, if you add the last code snippet from the previous article to the test program, a couple of strange-looking lines of text appears among the ok/failed messages:
online.effbot.org done www.bbc.co.uk done www.example.com failed error: uncaptured python exception, closing channel <async_http connected at 8eb07c> (exceptions.AttributeError:file_consumer instance has no attribute 'file' [C:\py21\lib\asyncore.py|poll|95] [C:\py21\lib\asyncore.py|handle_read_event|383] [http_client.py|handle_read|77] [my-test-program.py|feed|15]) www.scripting.com done
(Directory names and line numbers may vary.)
The error: uncaptured python exception message is generated by asyncore‘s default error handler when a callback raises a Python exception. This message is actually a compact rendition of a standard python traceback, printed on a single line. Here’s the deciphered version:
www.bbc.co.uk done www.example.com Traceback (most recent call last): File C:\py21\lib\asyncore.py, line 95, in poll: File C:\py21\lib\asyncore.py, line 383, in handle_read_event: File http_client.py, line 77, in handle_read: File my-test-program.py, line 15, in feed: AttributeError:file_consumer instance has no attribute 'file' online.effbot.org done www.scripting.com done
So what’s causing this error?
Note that the AttributeError occurs in the feed method, which is appears to be called despite the fact that the consumer did close the socket in the http_header method.
The http_client is supposed code to deal with this, by checking the connected flag attribute after calling the http_header consumer method. That flag was cleared by the close method in earlier versions of asyncore, but that was changed somewhere on the way from Python 1.5.2 to Python 2.1.
(And the reason I didn’t notice was sloppy testing: my test script contained enough debugging print statements to make me miss the error message. Sorry for that.)
Closing the Channel From the Consumer, Revisited
The obvious workaround is of course to explicitly clear the attribute in the consumer’s http_header method:
class file_consumer: def http_header(self, client): if (client.status[1] != "200" or client.header["content-type"] != "text/xml"): print client.host, "failed" client.close() # bail out client.connected = 0 return self.host = client.host self.file = None ...
However, the connected flag is undocumented, and may (in theory) disappear in future versions of asyncore.
To make your code more future-proof, it’s better to use return value or an exception to indicate that the channel should be closed.
The following example uses a custom CloseConnection exception for this purpose:
class file_consumer: def http_header(self, client): if (client.status[1] != "200" or client.header["content-type"] != "text/xml"): print client.host, "failed" raise http_client.CloseConnection self.host = client.host self.file = None
Here are the necessary additions to the http_client module:
class CloseConnection(Exception): pass ... try: self.consumer.http_header(self) except CloseConnection: self.close() return
Overriding Asyncore’s Error Handling
The error message is printed by a method called handle_error. To change the look of the error message, you can override this in your dispatcher subclass. For example, here’s a version that prints a traditional traceback:
import traceback class my_channel(asyncore.dispatcher_with_send): ... def handle_error(self): traceback.print_exc() self.close() ...
With the above lines added to the async_http class, you’ll get the following message instead:
www.bbc.co.uk done www.example.com failed Traceback (most recent call last): File "C:\py21\lib\asyncore.py", line 95, in poll obj.handle_read_event() File "C:\py21\lib\asyncore.py", line 383, in handle_read_event self.handle_read() File "http_client.py", line 77, in handle_read self.consumer.feed(data) File "my-test-program.py", line 15, in feed if self.file is None: AttributeError: file_consumer instance has no attribute 'file' online.effbot.org done www.scripting.com done
Parsing RSS Files #
As shown in the first article, an RSS file contains summary information about a (portion of a) site, including a list of current news items.
For both the channel itself and the items, the RSS file can contain a title, a link to an HTML page, and a description field:
<rss version="0.91"> <channel> <title>the eff-bot online</title> <link>http://online.effbot.org</link> <description>Fredrik Lundh's clipbook.</description> <language>en-us>/language> ... <item> <title>spam, spam, spam</title> <link>http://online.effbot.org#85292735</link> <description>for the first seven months of 2002, the spam filters watching fredrik@pythonware.com has</description> </item> ... </channel> </rss>
Note that the item elements are stored as child elements to the channel element. Both the channel element and the individual item elements may contain additional subelements, including the language element present in this example. We’ll look at some additional elements in a later article; for now, we’re only interested in the three basic elements.
XML Parsers #
To parse an XML-based format like RSS, you need an XML parser. Python provides several ways to parse XML data, including the standard xmllib module which is a simple event-driven XML parser, the pyexpat parser and other components provided in the standard xml package, the PyXML extension library, and many others.
For the first version of the RSS parser, we’ll use the xmllib parser. You can plug in another parser if you need more features or better performance (and as you’ll see, chances are that you need more, or at least different features. More on this in a later article).
The xmllib parser works pretty much like the asyncore dispatcher; the module provides a parser base class that processes incoming data, and calls methods for different “XML events”. To handle the events, you should subclass the parser class, and implement methods for the events you need to deal with.
For the RSS parser, you need to implement the following methods:
-
start_TAG is called when the start tag (<TAG …>) for an element called TAG is found. The handler is called with a single argument, which is a dictionary containing the element attributes, if any.
-
end_TAG is called when the end tag (</TAG>) for an element called TAG is found.
-
handle_data is called for text between the elements (so-called character data). This handler is called with a single argument, a string containing the text. This method may be called more than once for any given character data segment.
For example, when parsing this XML fragment…
"<title>Odds & Ends</title>\n"
…the xmllib parser will call the following methods:
self.start_title({}) self.handle_data("Odds ") self.handle_data("&") self.handle_data(" Ends") self.end_title() self.handle_data("\n")
Note that standard XML character entities like & are decoded by the parser, and are passed to the handle_data method as ordinary character data.
If start or end handlers are missing for elements that appear in the XML document, the corresponding start or end tags are silently ignored by the parser (but character data inside the element is still passed to handle_data).
Here’s a minimal test program that implements a character data handler, and start and end tag handlers for the three RSS elements we’re interested in:
import xmllib class rss_parser(xmllib.XMLParser): data = "" def start_title(self, attr): self.data = "" def end_title(self): print "TITLE", repr(self.data) def start_link(self, attr): self.data = "" def end_link(self): print "LINK", repr(self.data) def start_description(self, attr): self.data = "" def end_description(self): print "DESCRIPTION", repr(self.data) def handle_data(self, data): self.data = self.data + data import sys file = open(sys.argv[1]) parser = rss_parser() parser.feed(file.read()) parser.close()
Note that the start methods set the data member to an empty string, the handle_data method adds text to that string, and the end handlers print out the string.
Also note that you pass in the raw RSS data to the parser’s feed method, and call close method when you’re done.
Here’s some sample output from this script (using the BBC newsfeed we downloaded earlier):
$ python rss-test.py www.bbc.co.uk.rss TITLE 'BBC News | Front Page' LINK 'http://news.bbc.co.uk/go/rss/-/1/hi/default.stm' DESCRIPTION 'Updated every minute of every day' TITLE 'BBC News Online' LINK 'http://news.bbc.co.uk' TITLE 'Blair and Bush talk tough on Iraq\r\n' LINK 'http://news.bbc.co.uk/go/rss/-/1/hi/world/middle_east/2243684.stm DESCRIPTION 'British PM Tony Blair says he has a "shared strategy" ... TITLE "Al-Qaeda 'plotted nuclear attacks'" LINK 'http://news.bbc.co.uk/go/rss/-/1/hi/world/middle_east/2244146.stm DESCRIPTION 'Two alleged masterminds of the 11 September attacks ... TITLE "Rix: 'Scum' will profit from Tube" LINK 'http://news.bbc.co.uk/go/rss/-/1/hi/uk_politics/2244076.stm' DESCRIPTION 'Train drivers\' union leader Mick Rix says profits ... TITLE 'Ex-arms inspector defends Baghdad' LINK 'http://news.bbc.co.uk/go/rss/-/1/hi/world/middle_east/2243627.stm DESCRIPTION 'Scott Ritter\xb8 once head of UN inspectors in Iraq\xb8 ... TITLE 'Police warning as flash floods hit city' LINK 'http://news.bbc.co.uk/go/rss/-/1/hi/scotland/2244003.stm' DESCRIPTION 'People are advised not to travel to Inverness after ...
The first title/link/description combination contains information about the site, the others contain information about individual items.
(Note that there are extra title and link values in first section. If you look in the source RSS file, you’ll notice that they come from an extra image element, which we can safely ignore for the moment.)
To get a usable RSS parser, all you have to do is to add some logic that checks where in the file we are, and adds element values to the right data structure.
In the following example, the element handlers update a common current dictionary attribute, which is set to point to either the channel information dictionary, or a dictionary for each item (stored in the items list). This version also does some very basic syntax checking.
import xmllib class ParseError(Exception): pass class rss_parser(xmllib.XMLParser): def __init__(self): xmllib.XMLParser.__init__(self) self.rss_version = None self.channel = None self.current = None self.data_tag = None self.data = None self.items = [] # stuff to deal with text elements def _start_data(self, tag): if self.current is None: raise ParseError("%s tag not in channel or item element" % tag) self.data_tag = tag self.data = "" def handle_data(self, data): if self.data is not None: self.data = self.data + data # cdata sections are handled as any other character data handle_cdata = handle_data def _end_data(self): if self.data_tag: self.current[self.data_tag] = self.data or "" # main rss structure def start_rss(self, attr): self.rss_version = attr.get("version") def start_channel(self, attr): if self.rss_version is None: raise ParseError("not a valid RSS 0.9x file") self.current = {} self.channel = self.current def start_item(self, attr): if self.rss_version is None: raise ParseError("not a valid RSS 0.9x file") self.current = {} self.items.append(self.current) # content elements def start_title(self, attr): self._start_data("title") end_title = _end_data def start_link(self, attr): self._start_data("link") end_link = _end_data def start_description(self, attr): self._start_data("description") end_description = _end_data
The _start_data and _end_data methods are used to switch on and off character data processing in handle_data.
Here’s a test script, which prints each item to standard output (via the end_item method).
import rss_parser, string, sys class my_rss_parser(rss_parser.rss_parser): def end_item(self): item = self.items[-1] print string.strip(item.get("title") or "") print item.get("link") print item.get("description") print for filename in sys.argv[1:]: file = open(filename) try: parser = my_rss_parser() parser.feed(file.read()) parser.close() except: print "=== cannot parse %s:" % filename print "===", sys.exc_type, sys.exc_value
Incremental parsing #
The above example reads the entire XML document from disk, and passes it to the parser in one go. The xmllib library also supports incremental parsing, allowing you to pass in XML fragments as you receive them. Just keep calling the feed method, and make sure to call close when you’re done. The parser framework will take care of the rest.
This feature is of course a perfect match for the http_client client class we developed in the first article; by plugging in a parser instance as the consumer, you can parse RSS items as they arrive over the network.
The following script provides an http_rss_parser class that adds the required http_header and http_failed methods to the parser, and uses an end_item handler to print incoming items:
import rss_parser, string class http_rss_parser(rss_parser.rss_parser): def http_header(self, client): if (client.status[1] != "200" or client.header["content-type"] != "text/xml"): raise http_client.CloseConnection self.host = client.host def http_failure(self, client): pass def end_item(self): item = self.items[-1] print " ", string.strip(item.get("title") or ""), print "[%s]" % self.host print " ", string.strip(item.get("link") or "") print print item.get("description") print
Here’s a driver script that reads a list of URLs from a text file named channels.txt, and fires up one asynchonous client for each channel.
import asyncore, http_client file = open("channels.txt") for url in file.readlines(): url = url.strip() if url: http_client.do_request(url, http_rss_parser()) asyncore.loop()
The output is a list of titles, links, and descriptions. Here’s an excerpt:
Blair defiant over Iraq [www.bbc.co.uk] http://news.bbc.co.uk/go/rss/-/1/hi/uk_politics/2247366.stm Prime Minister Tony Blair confronts his trade union critics ... arrgh! [online.effbot.org] http://online.effbot.org#85432883 "Kom kom nu hit min vän, för glädjen blir större när man delar ... Buffet killers [www.kottke.org] http://www.kottke.org/02/09/020910buffet_kille.html We're in Las Vegas and it's buffet time. It's always buffet ...
Note: When I write this, the www.scripting.com channel has just switched to something that appears to be an experimental version of Dave Winer’s RSS 2.0, which moves all RSS tags into a default namespace. The xmllib parser always takes the namespace into account, so it won’t find a single thing in that channel. Hopefully, this will be fixed in a not too distant future.
:::
That’s all for today.
In the next article, we’ll look at what happens if you add dozens or hundreds of channels to the channels.txt file, and discuss how to deal with that. We’ll also build a simple RSS viewer using the Tkinter library.
In the meantime, if you’re running Unix, and are using a modern mail client that highlights URLs embedded in text mails, you can mail yourself the output from this program and let your mail reader do the rest:
$ python getchannels.py | mail -s effnews yourself
EffNews Part 3: Displaying RSS Data #
Storing Channel Lists #
In the previous article, we ended up creating a simple utility that downloads a number of channels, parses their content, and writes titles, links and descriptions to the screen as plain text. The list of channels to read is stored in a text file, channels.txt.
Other RSS tools use a variety of file formats to store channel lists. One popular format is OPML (Outline Processor Markup Language) (dead link), which is a simple XML-based format. An OPML file contains a head element that stores information about the OPML file itself, and a body element that holds a number of outline elements.
Each outline element can have any number of attributes. Common attributes include type (how to interpret other attributes) and text (what to display for this node in an outline viewer). Outline elements can be nested.
When storing RSS channels, the type attribute is set to rss, and channel information is stored in the title and xmlUrl attributes. Here’s an example:
<opml version="1.0"> <body> <outline type="rss" title="bbc news" xmlUrl="http://www.bbc.co.uk/syndication/feeds/news/ukfs_news/front_page/rss091.xml" /> <outline type="rss" title="effbot.org" xmlUrl="http://online.effbot.org/rss.xml" /> <outline type="rss" title="scripting news" xmlUrl="http://www.scripting.com/rss.xml" /> <outline type="rss" title="mark pilgrim" xmlUrl="http://diveintomark.org/xml/rss2.xml" /> <outline type="rss" title="jason kottke" xmlUrl="http://www.kottke.org/index.xml" /> <outline type="rss" title="example" xmlUrl="http://www.example.com/rss.xml" /> </body> </opml>
Parsing OPML #
You can use the xmllib library to extract channel information from OPML files. The following parser class looks for outline tags, and collects titles and channel URLs from the attributes (Note that the parser looks for both xmlUrl and xmlurl attributes; both names are used in the documentation and samples I’ve seen).
import xmllib class ParseError(Exception): pass class opml_parser(xmllib.XMLParser): def __init__(self): xmllib.XMLParser.__init__(self) self.channels = [] def start_opml(self, attr): if attr.get("version", "1.0") != "1.0": raise ParseError("unknown OPML version") def start_outline(self, attr): channel = attr.get("xmlUrl") or attr.get("xmlurl") if channel: self.add_channel(attr.get("title"), channel) def add_channel(self, title, channel): # can be overridden self.channels.append((title, channel)) def load(file): file = open(file) parser = opml_parser() parser.feed(file.read()) parser.close() return parser.channels
The load function feeds the content of an OPML file through the parser, and returns a list of (title, channel URL) pairs.
Here’s a simple script that uses the http_rss_parser class from the second article to fetch and render all channels listed in the channels.opml file:
import asyncore, http_client, opml_parser channels = opml_parser.load("channels.opml") for title, uri in channels: http_client.do_request(uri, http_rss_parser()) asyncore.loop()
Managing Downloads #
You can find RSS channel collections in various places on the web, such as NewsIsFree and Syndic8 (dead link). These sites have links to thousands of RSS channels from a wide variety of sources.
Most real people probably use a dozen feeds or so, but someone like the pirate Pugg (“For I am not your usual uncouth pirate, but refined and with a Ph.D., and therefore extremely high-strung“) would most likely want to subscribe to every feed under the sun. What would happen if he tried?
If you pass an OPML file containing a thousand feeds to the previous script, it will happily issue a thousand socket requests. Exactly what happens depends on your operating system, but it’s likely that it will run out of resources at some point (if you decide to try this out on your favourite platform, let me know what happens).
To avoid this problem, you can add requests to a queue, and make sure you never create more sockets than your computer can handle (leaving some room for other applications is also a nice thing to do).
Limiting the number of simultaneous connections
Here’s a simple manager class that never creates more than a given number of sockets:
import asyncore class http_manager: max_connections = 4 def __init__(self): self._queue = [] def request(self, uri, consumer): self._queue.append((uri, consumer)) def poll(self, timeout=0.1): # activate up to max_connections channels while self._queue and len(asyncore.socket_map) < self.max_connections: http_client.do_request(*self._queue.pop(0)) # keep the network running asyncore.poll(timeout) # return non-zero if we should keep polling return len(self._queue) or len(asyncore.socket_map)
In this class, the request method adds URLs and consumer instances to an internal queue. The poll method makes sure at least max_connections asyncore objects are activated (asyncore keeps references to active sockets in the socket_map variable).
To use the manager, all you have to do is to create an instance of the http_manager class, call the request method for each channel you want fetch, and keep calling the poll method over and over again to keep the network traffic going:
manager = http_manager.http_manager() manager.request(url, consumer) while manager.poll(1): pass
Limiting the size of an RSS file
You can also use the manager for other purposes. For example, to prevent denial-of-service attacks from malicious (or confused) RSS providers, you can use the http client’s byte counters, and simply kill the socket if it has processed more than a given number of bytes:
max_size = 1000000 # bytes for channel in asyncore.socket_map.values(): if channel.bytes_in > self.max_size: channel.close()
Timeouts
Another useful feature is a time limit; instead of checking the byte counter, you can check the timestamp variable, and compare it to the current time:
max_time = 30 # seconds now = time.time() for channel in asyncore.socket_map.values(): if now - channel.timestamp > self.max_time: channel.close()
And of course, nothing stops you from checking both the size and the elapsed time in the same loop:
now = time.time() for channel in asyncore.socket_map.values(): if channel.bytes_in > self.max_size: channel.close() if now - channel.timestamp > self.max_time: channel.close()
Building a Simple User Interface #
Okay, enough infrastructure. It’s time to start working on something that ordinary humans might be willing to use: a nice, welcoming, easy-to-use graphical front-end.
Introducing Tkinter
The Tkinter library (dead link) provides a number of portable building blocks for graphical user interfaces. Code written for Tkinter runs, usually without any changes, on systems based on Windows, Unix (and Linux), as well as on the Macintosh.
The most important building blocks provided by Tkinter are the standard widgets. The term widget is used both for a piece of code that may control a region of the screen (a widget class) and a specific region controlled by that code (a widget instance). Tkinter provides about a dozen standard widgets, such as labels, input fields, and list boxes, and it’s also relatively easy to create new custom widgets.
In Tkinter, each widget is represented by a Python class. When you create an instance of that class, the Tkinter layer will create a corresponding widget and display it on the screen.
Each Tkinter widget must have a parent widget, which “owns” the widget. When the parent is moved, the child widget also moves. When the parent is destroyed, the child widget is destroyed as well.
Here’s an example:
from Tkinter import * root = Tk() root.title("example") widget = Label(root, text="this is an example") widget.pack() mainloop()
This script creates a root window by calling the Tk widget constructor. It then calls the title method to set the window title, and uses the Label widget constructor to add a text label to the window. Note that the parent widget is passed in as the first argument, and that keyword arguments are used to specify the text.
The script then calls the pack method. This is a special method that tells Tkinter to display the label widget inside it’s parent (the root window, in this case), and to make the parent large enough to hold the label.
Finally, the script calls the mainloop function. This function starts an event loop that looks for events from the window system. This includes events like key presses, mouse actions, and drawing requests, which are passed on to the widget implementation.
For more information on Tkinter, see An Introduction to Tkinter and the other documentation (dead link) available from python.org.
Prototyping the EffNews application window
For the first prototype, let’s use a standard two-panel interface, with a list of channels to the left, and the contents of the selected channel in a larger panel to the right.
The Tkinter library provides a standard Listbox widget that can be used for the channel list. This widget displays a number of text strings, and lets you select one item from the list (or many, depending on how the widget is configured).
To render the contents, it would be nice if we could render the title on a line in a distinct font, followed by the description in a more neutral style. Something like this:
High hopes for new Wembley
FA chief Adam Crozier says the new Wembley will be the best stadium in the world.
Archer moved from open prison
Lord Archer is being moved from his open prison after breaking its rules by attending a lunch party during a home visit.
For this purpose, you can use the Text widget. This widget allows you to display text in various styles, and it takes care of things like word wrapping and scrolling. (The Text widget can also be used as a full-fledged text editor, but that’s outside the scope for this series. At least right now.)
Before you start creating widgets, the newsreader script will need to do some preparations. The first part imports Tkinter and a few other modules, creates a download manager instance, and parses an OPML file to get the list of channels to load:
from Tkinter import * import sys import http_manager, opml_parser manager = http_manager.http_manager() if len(sys.argv) > 1: channels = opml_parser.load(sys.argv[1]) else: channels = opml_parser.load("channels.opml")
Note that you can pass in the name of an OPML file on the command line (sys.argv[0] is the name of the program, sys.argv[1] the first argument). If you leave out the file name, the script loads the channels.opml file.
The next step is to create the root window. At the top of the window, add a Frame widget that will act like a toolbar. The frame is an empty widget, which may have a background colour and a border, but no content of it’s own. Frames are mostly used to organize other widgets, like the buttons on the toolbar.
root = Tk()
root.title("effnews")
toolbar = Frame(root)
toolbar.pack(side=TOP, fill=X)
The toolbar is packed towards the top of the parent widget (the root window). The fill option tells the packer to make the widget as wide as its parent (instead of X, you can use Y to make it as high as the parent, and BOTH to fill in both directions).
For now, the only thing we’ll have in the toolbar is a reload button. When you click this button, the schedule_reloading function adds all channels to the manager queue.
def schedule_reloading(): for title, channel in channels: manager.request(channel, http_rss_parser(channel)) b = Button(toolbar, text="reload", command=schedule_reloading) b.pack(side=LEFT)
Here, the button is packed against the left side of the parent widget (the toolbar, not the root window). The command option is used to call a Python function when the button is pressed.
The http_rss_parser class used here is a variant of the consumer class with the same name that you’ve used earlier. It should parse RSS data, and store the incoming items somewhere. We’ll get to the code for this class in a moment.
Next, we’ll add a Tkinter Listbox widget, and fill it with channel titles. The listbox is packed against the left side of the parent widget, under the toolbar (which was packed before the listbox).
channel_listbox = Listbox(root, background="white") channel_listbox.pack(side=LEFT, fill=Y) for title, channel in channels: # load listbox channel_listbox.insert(END, title) def select_channel(event): selection = channel_listbox.curselection() if selection: selection = int(selection[0]) title, channel = channels[selection] update_content(channel) channel_listbox.bind("<Double-Button-1>", select_channel)
The select_channel function is used to display the contents of a channel in the Text widget. The curselection method returns the indexes of all selected items. The indexes work like Python list idexes, but they are returned as strings. If the list is not empty (that is, if at least one item is selected), the index is converted to an integer, and used to get the channel URL from the channels list. The update_content function displays that channel in the text widget; we’ll get back to this function later in this article.
The bind call, finally, sets things up so that the select_channel function is called when the user double-clicks on an item in the listbox.
To complete the user interface, create a text widget for the channel contents. The widget is packed against the top of remaining space in the parent widget (it ends up under the toolbar, and to the right of the listbox). The fill option is used to make it fill the entire space, and the expand option tells Tkinter that if the user resizes the application window, the text widget gets any extra space.
content_pane = Text(root, wrap=WORD) content_pane.pack(side=TOP, fill=BOTH, expand=1) content_pane.tag_config("head", font="helvetica 12 bold", foreground="blue") content_pane.tag_config("body", font="helvetica 10") mainloop()
The tag_config methods are used to defined styles to use in the text widget. Here, we defined two styles; text using the head style is drawn in a 12-point bold Helvetica font, and coloured blue; text using the body style is drawn in a smaller Helvetica font, using the default colour.
That’s it.
Almost. You also need to implement the http_rss_parser parser and the update_content function.
Let’s start with the parser.
Storing the channel items
You can reuse the http_rss_parser classes from the previous article pretty much right away. All you have to do is to put the channel items somewhere, so they can be found by the update_content function.
The following example adds a channel identifier (the URL) as an object attribute, and uses it to store the collected items in a global dictionary when it reaches the end of the file. If the identifier matches the current_channel variable, it also calls the update_content function.
items = {} class http_rss_parser(rss_parser.rss_parser): def __init__(self, channel): rss_parser.rss_parser.__init__(self) self._channel = channel def http_header(self, client): if (client.status[1] != "200" or client.header["content-type"] != "text/xml"): raise http_client.CloseConnection def http_failure(self, client): pass def end_rss(self): items[self._channel] = self.items if self._channel == current_channel: update_content(self._channel) # update display
Displaying channel items
The next piece of the puzzle is the update_content function. This function takes a channel identifier (the URL), and displays the items in the text window.
current_channel = None def update_content(channel): global current_channel current_channel = channel # clear the text widget content_pane.delete(1.0, END) if not items.has_key(channel): content_pane.insert(END, "channel not loaded") return # add newsitems to the text widget for item in items[channel]: title = item.get("title") if title: content_pane.insert(END, title.strip() + "\n", "head") description = item.get("description") if description: content_pane.insert(END, description.strip() + "\n", "body") content_pane.insert(END, "\n")
The global current_channel variable keeps track of what’s currently displayed in the text widget. It is used by the parser class to update the widget, if the channel is being displayed.
Data may be missing from the items directionary, either because the parser haven’t finished yet, or because the channel could not be read or parsed. In this case, the function displays the text channel not loaded and returns. Otherwise, it loops over the items, and adds the titles and descriptions to the text widget. The third argument to insert is the style name.
Keeping the network traffic going
If you put the pieces together, you’ll find that the program is almost working. It creates the widgets and displays them, loads the channels into the listbox, and schedules a number of http requests. But that’s all that happens; the requests never finish.
To fix this, you need to keep the poll method of the download manager at regular intervals. The Tkinter library contains a convenient timer mechanism that you can use for this purpose; the after method is used to register a callback that will be called after a given period of time (given in milliseconds).
The following code sets things up so that the network will be polled about 10 times a second. It also schedules all channels for loading when the application is started, and selects the first item in the listbox before entering the Tkinter mainloop.
import traceback # schedule all channels for loading schedule_reloading() def poll_network(root): try: manager.poll(0.1) except: traceback.print_exc() root.after(100, poll_network, root) # start polling the network poll_network(root) # display the first channel, if there is one if channels: channel_listbox.select_set(0) update_content(channels[0][1]) # start the user interface mainloop()
Putting it all together
For your convenience, here’s the final script:
from Tkinter import * import http_client, http_manager import opml_parser import rss_parser import sys, traceback # # item database items = {} # # parse channels, and store item lists in the global items dictionary class http_rss_parser(rss_parser.rss_parser): def __init__(self, channel): rss_parser.rss_parser.__init__(self) self._channel = channel def http_header(self, client): if (client.status[1] != "200" or client.header["content-type"] != "text/xml"): raise http_client.CloseConnection def http_failure(self, client): pass def end_rss(self): items[self._channel] = self.items if self._channel == current_channel: update_content(self._channel) # update display # # globals manager = http_manager.http_manager() if len(sys.argv) > 1: channels = opml_parser.load(sys.argv[1]) else: channels = opml_parser.load("channels.opml") # # create the user interface root = Tk() root.title("effnews") # # toolbar toolbar = Frame(root) toolbar.pack(side=TOP, fill=X) def schedule_reloading(): for title, channel in channels: manager.request(channel, http_rss_parser(channel)) b = Button(toolbar, text="reload", command=schedule_reloading) b.pack(side=LEFT) # # channels channel_listbox = Listbox(root, background="white") channel_listbox.pack(side=LEFT, fill=Y) def select_channel(event): selection = channel_listbox.curselection() if selection: selection = int(selection[0]) title, channel = channels[selection] update_content(channel) channel_listbox.bind("<Double-Button-1>", select_channel) for title, channel in channels: channel_listbox.insert(END, title) # # content panel content_pane = Text(root, wrap=WORD) content_pane.pack(side=TOP, fill=BOTH, expand=1) content_pane.tag_config("head", font="helvetica 12 bold", foreground="blue") content_pane.tag_config("body", font="helvetica 10") current_channel = None def update_content(channel): global current_channel current_channel = channel # clear the text widget content_pane.delete(1.0, END) if not items.has_key(channel): content_pane.insert(END, "channel not loaded") return # add newsitems to the text widget for item in items[channel]: title = item.get("title") if title: content_pane.insert(END, title.strip() + "\n", "head") description = item.get("description") if description: content_pane.insert(END, description.strip() + "\n", "body") content_pane.insert(END, "\n") # get going schedule_reloading() def poll_network(root): try: manager.poll(0.1) except: traceback.print_exc() root.after(100, poll_network, root) poll_network(root) if channels: channel_listbox.select_set(0) update_content(channels[0][1]) # display first channel mainloop()
If you run this script on the sample channel.opml file from the beginning of this article, you’ll get a window looking something like this:
The first channel is selected, and if everything goes well, the channel contents will appear in the window after a second or so. To display any other channel, double-click on the channel title in the listbox.
If the text won’t fit in the text widget, you can scroll the text by pressing the mouse pointer inside the widget and dragging up or down. (We’ll add scrollbars in the next article.)
To refresh the contents, click the reload button. All channels will be loaded from the servers, and the items listing in the text widget will be updated.
About the sample channels
The sample channels.opml file contains six channels. Only three of them are properly rendered by the current prototype.
The bbc news, effbot.org, and kottke channels all use the RSS 0.9x file format. However, as you may notice, the bbc news channel is the only one that works flawlessly.
The effbot.org channel is generated by the Blogger Pro tool, which has a tendency to mess up on non-US character encodings. Since some articles are written in Swedish, using ISO Latin-1 characters, you may find that the XML parser chokes on the contents. Blogger is also known to generate bad output if the source uses XML character entities. To deal with broken feeds like this, you need a more robust RSS parser.
The kottke channel is in a better shape (possibly because he’s not using odd european characters), but you may find that the description contains strange line endings and strange little boxes. The line endings are probably copied verbatim from the site’s source code; web browsers usually don’t care about line endings. And the boxes are carriage return characters that are also copied as is from the source code. Getting rid of the line feeds and the bogus whitespace characters should be straightforward.
The pilgrim feed uses the new RSS 2.0 format. RSS 2.0 is an extension to the 0.9x format that’s supposed to be fully backwards compatible, and the feed renders just fine in the current prototype.
The scripting news feed also uses the RSS 2.0 format, but it places all tags in an undocumented default namespace (http://backend.userland.com/rss2). As a result, the current prototype parser won’t find a single thing in that feed. (And as expected, all attempts to find out if this is a problem with the feed or with the documentation have failed. But that’s another story.)
The example channel, finally, contains a bogus URL, and results in a channel not loaded message. This is of course exactly what’s supposed to happen.
:::
In the next article, we’ll continue working on the prototype, trying to turn it into a more useful and more robust application. We’ll look at ways to deal with possibly broken channels, such as the effbot.org and scripting news feeds.
EffNews Part 4: Parsing More RSS Files #
Parsing RSS 2.0 Files With Undocumented Namespaces
“ In general, an implementation must be conservative in its sending behavior, and liberal in its receiving behavior. That is, it must be careful to send well-formed datagrams, but must accept any datagram that it can interpret (e.g., not object to technical errors where the meaning is still clear). ”RFC 791, Internet Protocol, September 1981, Jon Postel (ed).
The RSS 2.0 standard (where RSS stands for “Really Simple Syndication”) is an attempt to upgrade the RSS 0.9x version of the format. It adds a number of new fields, designed with interactive RSS aggregators in mind, and also adds support for RSS extensions through custom namespaces.
As I’ve mentioned earlier, RSS 2.0 files comes in several flavours. Some providers strictly adhere to the RSS 2.0 specification (dead link) and produce feeds where all the core elements (rss, item, title, etc) lives in XML’s standard namespace. This is a good thing, since it allows us to parse them with the same parser as we used for 0.9x; even if the version attribute on the rss element says 2.0, the parser will see an undecorated item tag, and will call the start_item handler.
Unfortunately, some tools generate RSS 2.0 feeds with all elements moved into a namespace. What’s worse, the RSS 2.0 specification doesn’t mention this namespace at all, and provides very few clues as how to deal with the presence or non-presence of namespaces on the core RSS elements.
For example, the scripting news feed contains the following declarations at the top:
<rss version="2.0" xmlns="http://backend.userland.com/rss2" xmlns:blogChannel="http://backend.userland.com/blogChannelModule"> <channel> <title>Scripting News</title> <link>http://www.scripting.com/</link> ...
The xmlns=”http://backend.userland.com/rss2” attribute provides a default namespace. This means that unless you specify otherwise, the rss element and all its children will be assumed to belong to the http://backend.userland.com/rss2 namespace. Since our code looks for elements that belong to the standard namespace, the parser won’t find a thing.
(For more about how namespaces really work, I recommend James Clark’s XML Namespaces tutorial.)
Ignoring all namespaces
The xmllib library provides a fallback mechanism that can be used to deal with unknown elements. When the parser finds an element for which there is no start handler, it calls the unknown_starttag method with the element tag and a dictionary containing the attributes. Likewise, when an element ends and there’s no end handler, the parser calls the unknown_endtag method.
To see this in action, you can add stub versions of these methods to the rss_parser class, and run it on an RSS 2.0 feed:
class rss_parser(xmllib.XMLParser): ... def unknown_starttag(self, tag, attrib): print "START", repr(tag) if attrib: print attrib def unknown_endtag(self, tag): print "END", repr(tag) ...
Running this on the scriping news feed results in something like:
START 'http://backend.userland.com/rss2 rss' {'http://backend.userland.com/rss2 version': '2.0'} START 'http://backend.userland.com/rss2 channel' START 'http://backend.userland.com/rss2 title' END 'http://backend.userland.com/rss2 title' START 'http://backend.userland.com/rss2 link' END 'http://backend.userland.com/rss2 link' START 'http://backend.userland.com/rss2 description' END 'http://backend.userland.com/rss2 description' ... START 'http://backend.userland.com/rss2 item' START 'http://backend.userland.com/rss2 description' END 'http://backend.userland.com/rss2 description' START 'http://backend.userland.com/rss2 pubDate' END 'http://backend.userland.com/rss2 pubDate' START 'http://backend.userland.com/rss2 guid' END 'http://backend.userland.com/rss2 guid' END 'http://backend.userland.com/rss2 item' ...
As you can see, the xmllib parser combines the namespace string with the tag name into a single string, using a single space to separate the two parts.
One easy way to deal with the RSS 2.0 confusion is to ignore all namespaces. In the unknown handlers, you can simply split the tag name into two parts, and use the last part (known as the local part) to select the right method. Something like this could work:
def unknown_starttag(self, tag, attrib): try: namespace, tag = tag.split() except ValueError: pass # ignore this tag else: if tag == "rss": self.start_rss(attrib) elif tag == "channel": self.start_channel(attrib) ... etc def unknown_endtag(self, tag): try: namespace, tag = tag.split() except ValueError: pass # ignore this tag else: if tag == "rss": self.end_rss() elif tag == "channel": self.end_channel() ... etc
To simplify the code, you can reuse portions of xmllib‘s existing tag dispatcher. To get the standard handler for a tag name, all you have to do is to look it up in the elements dictionary. This dictionary maps tag names to (start handler, end handler) tuples. By adding the following methods to the parser class, you get a parser that ignores the namespace for all elements:
def unknown_starttag(self, tag, attrib): start, end = self._gethandlers(tag) if start: start(attrib) def unknown_endtag(self, tag): start, end = self._gethandlers(tag) if end: end() def _gethandlers(self, tag): try: namespace, tag = tag.split() except ValueError: pass # ignore this tag else: methods = self.elements.get(tag) if methods: return methods return None, None
This is almost enough to read the scripting news feed, but if you try it out, you’ll find that the parser raises a ParseError exception (not a valid RSS 0.9x file). A little more digging reveals that this exception is raised by the start_channel method, if the rss_version attribute is not set:
def start_rss(self, attr): self.rss_version = attr.get("version") def start_channel(self, attr): if self.rss_version is None: raise ParseError("not a valid RSS 0.9x file") self.current = {} self.channel = self.current
If you look at the output from the stub version, you’ll notice that the attribute dictionary contains something called “http://backend.userland.com/rss2 version” instead of the version attribute we expected.
This is actually a bug in some versions of xmllib; it applies the default namespace not only to unqualified element names, but also to unqualified attribute names. When dealing with more complex formats, this bug can really get in our way, but we’re ignoring namespaces anyway in this case, so we can simply look for any attribute that has the right local part:
def start_rss(self, attr): self.rss_version = attr.get("version") if self.rss_version is None: # no undecorated version attribute. as a work-around, # just look at the local part for k, v in attr.items(): if k.endswith(" version"): self.rss_version = v break
With these changes in place, we can use effnews.py to read the scripting news feed.
Almost, that is.
Compared to the other feeds, it doesn’t look quite right. Instead of a list of nice title/description items in the content pane, we get something far less friendly:
Reuters: <a href=”http://www.cnn.com/2002/WORLD/meast/09/28/turkey.uranium.reut/”>
Turkey seizes weapons-grade uranium</a>.
<a href=”http://doc.weblogs.com/discuss/msgReader$2489?mode=day”>
Phil Wolff</a>: “What would you be willing to do as a journalist
to improve your chances of getting your story listed on Google’s
front page for a prime time hour?”
There are no titles, and it looks as if the feed generator is putting HTML source code in the description, instead of the plain text description other feeds are using.
Obviously, you need to add some way to filter out the HTML elements from the description field, and possibly some way to generate a title line based on other information in the feed. This is a nice topic for a later article…
Ignoring only the backend.userland.com namespace
A problem with the current namespace workaround is that we don’t really care what namespace an element is using; every item element is assumed to be an RSS item, every title element is assumed to be an RSS title, and so on. But the RSS 2.0 specification explicitly allows RSS providers to use custom namespaces to add extra information, and nothing prevents them from reusing local names already in use by the RSS 2.0 specification.
Ignoring all namespace information might work for the moment, but it’s clearly not a future-proof solution.
Luckily, all you have to do to solve this is to add a single line to the _gethandlers method:
def _gethandlers(self, tag): try: namespace, tag = tag.split() except ValueError: pass # ignore this tag else: if namespace == "http://backend.userland.com/rss2": methods = self.elements.get(tag) if methods: return methods return None, None
With this test in place, the parser will treat RSS 0.9x elements, RSS 2.0 elements without a namespace, and RSS 2.0 elements in the http://backend.userland.com/rss2 namespaces as the same thing. All other elements will be ignored.
Allowing arbitrary namespaces for the core elements
The RSS 2.0 specification/sample mismatch could in fact be interpreted to mean that RSS 2.0 allows producers to use an arbitrary namespace for the RSS 2.0 elements. If I want to use http://effbot.org/schema/rss2, who can stop me?
To deal with this case, you can look at the namespace for the toplevel rss element, and allow other elements to have that namespace. Something like this might work:
rss_namespace = None def _gethandlers(self, tag): try: namespace, tag = tag.split() if tag == "rss" and not self.rss_namespace: self.rss_namespace = namespace except ValueError: pass # ignore this tag else: if namespace == self.rss_namespace: methods = self.elements.get(tag) if methods: return methods return None, None
To quote a leading XML expert, requiring people to implement things like this would be “silly indeed”, so let’s hope that the RSS 2.0 crowd sorts this one out some day, before feed providers start doing really silly things…
Parsing RSS 1.0 Files
While we’re at it, let’s look at the third version of the RSS format. In RSS 1.0, the RSS stands for “RDF Site Summary”, where RDF stands for “Resource Description Framework”. RDF is building block in something called the Semantic Web, which is a research project that’s likely to impact your future life in pretty much the same way as AI research has done over the last 30-40 years. But I digress.
An RSS 1.0 file is cleverly designed to look as a valid RDF file to RDF tools, and as an RSS 0.91 file to (some) RSS tools. In practice, as a feed provider, this means that people can read your feed in dozens of different RSS viewers, and use it to draw mostly meaningless graphs consisting of circles and arrows. But I digress.
Here’s an excerpt from Mark Pilgrim’s RSS 1.0 feed (which contains the same data as his 2.0 feed that we used earlier):
<rdf:RDF xmlns="http://purl.org/rss/1.0/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" ...> <channel rdf:about="http://diveintomark.org/"> <title>dive into mark</title> <link>http://diveintomark.org/</link> ... </channel> <item rdf:about="http://diveintomark.org/archives/2002/09/27.html#advanced_css_lists"> <title>Advanced CSS lists</title> <description>Mark Newhouse: CSS Design: Taming Lists. ... </description> <link>http://diveintomark.org/archives/2002/09/27.html#advanced_css_lists</link> <dc:subject>CSS<dc:subject> <dc:date>2002-09-27T23:22:56-05:00</dc:date> </item> ... </rdf:RDF>
This feed also uses a default namespace for all core elements. However, this is a documented namespace; all RSS 1.0 files are supposed to use this namespace. If it’s not there, it’s not an RSS 1.0 file.
Update 2002-11-16: Mark just told me that he’s no longer generating RSS 1.0 feeds, so the above link will take you to an RSS 2.0 feed.
Checking for multiple namespaces isn’t that much harder than checking for a single namespace. Here’s one way to do it:
# namespaces used for standard elements by different RSS formats RSS_NAMESPACES = ( "http://purl.org/rss/1.0/", # RSS 1.0 "http://backend.userland.com/rss2", # RSS 2.0 (sometimes) ) class rss_parser(xmllib.XMLParser): ... def _gethandlers(self, tag): try: namespace, tag = tag.split() except ValueError: pass # ignore this tag else: if namespace in RSS_NAMESPACES: methods = self.elements.get(tag) if methods: return methods return None, None
However, if you run this on an RSS 1.0 feed, you’ll get the same ParseError exception (not a valid RSS 0.9x file) as you got when tinkering with the 2.0 feeds, and for the same reason: the rss_version attribute is never set in the start_rss method.
If you look carefully at the RSS 2.0 sample, you’ll notice that there simply is no rss tag in the RSS 1.0 format. The root element is called RDF and lives in a www.w3.org namespace, so the start_rss handler will never be called.
There are several ways to fix this; the most obvious way is to look for the RDF start tag in the unknown_starttag handler, and set the rss_version attribute to something suitable. The downside is that if someone passes in an RDF file that doesn’t contain RSS 1.0 data, he’ll end up with an empty channel.
Another problem is that the effnews.py main application is using a end_rss handler to find out when we’re done parsing, so we have to change the parser interface as well.
And is it really a good idea to use the same code base for two radically different formats? Strictly speaking, RSS 1.0 files are RDF files, not XML files. Maybe we should use an RDF library to parse them, and extract the RSS information from the RDF data model? (This would also allow us to deal with feeds stored in alternative RDF representations.). But I digress.
To minimise the work, let’s settle for a compromise: we’ll keep the existing parser, and tweak it to generate the same events for an RSS 1.0 feed as it would generate for a corresponding RSS 0.9x or 2.0 feed. Turns out that this is really simple: just pretend that the RDF tag is really an rss tag without a version number, and check for some characteristic RSS 1.0 feature later on. The following example does the RDF-to-rss mapping in the _gethandlers method, and looks for the rdf:about attribute in the start_channel handler, if the version attribute wasn’t set by start_rss:
def _gethandlers(self, tag): # check if the tag lives in a known RSS namespace if tag == "http://www.w3.org/1999/02/22-rdf-syntax-ns# RDF": # this appears to be an RDF file. to simplify processing, # map this element to an "rss" element return self.elements.get("rss") try: namespace, tag = tag.split() except ValueError: pass # ignore else: if namespace in RSS_NAMESPACES: methods = self.elements.get(tag) if methods: return methods return None, None ... def start_channel(self, attr): if self.rss_version is None: # no version attribute; it might still be an RSS 1.0 file. # check if this element has an rdf:about attribute if attr.get("http://www.w3.org/1999/02/22-rdf-syntax-ns# about"): self.rss_version = "1.0" else: raise ParseError("cannot read this RSS file") self.current = {} self.channel = self.current
Parsing RSS 0.9 Files
(Added September 30, 2002)
There’s actually one more RSS version out in the wild: the original RSS 0.9 format that Netscape created for their my.netscape.com portal. The portal still exists, but it hasn’t supported RSS feeds in a long time, and the RSS 0.9 specification is no longer available on the net. But some providers are still using this format.
Like 1.0, the RSS 0.9 format is based on RDF, but it uses a much simpler XML structure. Here’s an example:
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns="http://my.netscape.com/rdf/simple/0.9/"< <channel> <title>Slashdot>/title> <link>http://slashdot.org/</link> <description>News for nerds, stuff that matters</description> </channel> ... <item> <title>Undelete In Linux</title> <link>http://slashdot.org/article.pl?sid=02/09/30/1233220</link> </item> ...
Just like RSS 1.0, this format uses a toplevel RDF tag, and all the other tags live in a namespace. But the rest of the file looks just like your usual 0.91 feed, with titles, links, and (optional) descriptions.
(the RDF connection was removed by Netscape in a later revision, RSS 0.91.)
To add support for this format, you need to add the RSS 0.9 namespace to the RSS_NAMESPACES list. You also need to set the rss_version variable somewhere; there’s no version attribute on the root element, and the channel element doesn’t contain an rdf:about attribute. The simplest solution is to look for the Netscape namespace in the _gethandlers method:
# namespaces used for standard elements by different RSS formats RSS_NAMESPACES = ( "http://my.netscape.com/rdf/simple/0.9/", # RSS 0.9 "http://purl.org/rss/1.0/", # RSS 1.0 "http://backend.userland.com/rss2", # RSS 2.0 (sometimes) ) class rss_parser(xmllib.XMLParser): ... def _gethandlers(self, tag): if tag == "http://my.netscape.com/rdf/simple/0.9/ channel": # this appears to be a my.netscape.com 0.9 file self.rss_version = "0.9" try: namespace, tag = tag.split() except ValueError: pass # ignore this tag else: if namespace in RSS_NAMESPACES: methods = self.elements.get(tag) if methods: return methods return None, None
Putting It All Together
For your convenience, here’s the updated parser, with additions in bold type. Just drop it in over the one from the second article, and you’ll be able to read most 0.9, 1.0 and 2.0 feeds:
import xmllib # namespaces used for standard elements by different RSS formats RSS_NAMESPACES = ( "http://my.netscape.com/rdf/simple/0.9/", # RSS 0.9 "http://purl.org/rss/1.0/", # RSS 1.0 "http://backend.userland.com/rss2", # RSS 2.0 (sometimes) ) class ParseError(Exception): pass class rss_parser(xmllib.XMLParser): def __init__(self): xmllib.XMLParser.__init__(self) self.rss_version = None self.channel = None self.current = None self.data_tag = None self.data = None self.items = [] def _gethandlers(self, tag): # check if the tag lives in a known RSS namespace if tag == "http://www.w3.org/1999/02/22-rdf-syntax-ns# RDF": # this appears to be an RDF file. to simplify processing, # map this element to an "rss" element return self.elements.get("rss") if tag == "http://my.netscape.com/rdf/simple/0.9/ channel": # this appears to be a my.netscape.com 0.9 file self.rss_version = "0.9" try: namespace, tag = tag.split() except ValueError: pass # ignore this element else: if namespace in RSS_NAMESPACES: methods = self.elements.get(tag) if methods: return methods return None, None def unknown_starttag(self, tag, attrib): start, end = self._gethandlers(tag) if start: start(attrib) def unknown_endtag(self, tag): start, end = self._gethandlers(tag) if end: end() # stuff to deal with text elements. def _start_data(self, tag): if self.current is None: raise ParseError("%s tag not in channel or item element" % tag) self.data_tag = tag self.data = "" def handle_data(self, data): if self.data is not None: self.data = self.data + data handle_cdata = handle_data def _end_data(self): if self.data_tag: self.current[self.data_tag] = self.data or "" # main rss structure def start_rss(self, attr): self.rss_version = attr.get("version") if self.rss_version is None: # no undecorated version attribute. as a work-around, # just look at the local names for k, v in attr.items(): if k.endswith(" version"): self.rss_version = v break def start_channel(self, attr): if self.rss_version is None: # no version attribute; it might still be an RSS 1.0 file. # check if this element has an rdf:about attribute if attr.get("http://www.w3.org/1999/02/22-rdf-syntax-ns# about"): self.rss_version = "1.0" else: raise ParseError("cannot read this RSS file") self.current = {} self.channel = self.current def start_item(self, attr): if self.rss_version is None: raise ParseError("cannot read this RSS file") self.current = {} self.items.append(self.current) # content elements def start_title(self, attr): self._start_data("title") end_title = _end_data def start_link(self, attr): self._start_data("link") end_link = _end_data def start_description(self, attr): self._start_data("description") end_description = _end_data
In Progress: EffNews Part 5: Odds and Ends #
This section is not complete.
Improving the RSS Support #
Supporting Non-XML Character Entities #
Many RSS feeds embed non-XML character entities in the description and title fields. This is allowed by the original 0.9 and 0.91 standards, but it’s unclear whether later standards really support this. Not that the standards matter here; feeds of all kinds use the entities, so we have to deal with them anyway.
The xmllib parser uses an entitydefs dictionary to translate entities to character strings. If an entity is not defined by this dictionary, the parser calls the unknown_entityref method. The following addition to our rss_parser class adds all standard HTML entities to the entitydefs dictionary when it’s first called, and replaces all other entities to an empty string.
class rss_parser(xmllib.XMLParser): ... htmlentitydefs = None def unknown_entityref(self, entity): if not self.htmlentitydefs: # lazy loading of entitydefs table import htmlentitydefs # make sure we don't overwrite entities already present in # the entitydefs dictionary (doing so will confuse xmllib) entitydefs = htmlentitydefs.entitydefs.copy() entitydefs.update(self.entitydefs) self.entitydefs = self.htmlentitydefs = entitydefs self.handle_data(self.entitydefs.get(entity, "")) ...
Handling Non-ASCII Character Sets #
Handling Windows CP1252 Gremlins #
Improving the HTTP Support #
Dealing With Different Content Types #
Using a list of feeds from Syndic8.com (dead link), I’ve tried the current RSS parser (including the entity support) on just over 2000 RSS feeds. The result isn’t very encouraging:
2010 feeds checked 137 feeds (6.8%) successfully read: rss 0.9: 17 feeds rss 0.91: 84 feeds rss 0.91fn: 2 feeds rss 0.92: 20 feeds rss 1.0: 10 feeds rss 2.0: 4 feeds
As it turns out, the problem isn’t so much the parser as the protocol layer; the current code only accepts responses if they’re using the text/xml content type. Here’s a breakdown of the feeds that returned a valid HTTP response. The following list shows the HTTP status code (200=OK) and the specified content type:
200 'text/plain; charset=utf-8': 1 feed 301 'text/html; charset=iso-8859-1': 1 feed 200 'text/html;charset=iso-8859-1': 1 feed 200 'text/xml; charset=utf-8': 1 feed 403 'text/html; charset=iso-8859-1': 1 feed 200 'text/XML': 1 feed 302 'text/html; charset=ISO-8859-1': 1 feed 200 'application/x-cdf': 1 feed 200 'application/unknown': 1 feed 200 'httpd/unix-directory': 2 feeds 200 'text/rdf': 2 feeds 200 'application/rss+xml': 2 feeds 200 'text/xml; charset=ISO-8859-1': 2 feeds 404 'text/html; charset=iso-8859-1': 3 feeds 200 'application/sgml': 3 feeds 302 'text/html; charset=iso-8859-1': 4 feeds 200 'text/html; charset=iso-8859-1': 4 feeds 200 'application/x-netcdf': 5 feeds 200 'text/plain; charset=ISO-8859-1': 7 feeds 200 'text/plain; charset=iso-8859-1': 8 feeds 200 'application/octet-stream': 10 feeds 200 'application/xml': 18 feeds 200 'text/html': 42 feeds 200 'text/xml': 191 feeds 200 'text/plain': 1660 feeds
Most feeds are returned as text/plain, and many use little-known (or unregistered) content types. The charset parameter is also somewhat common.
If we remove the check for content type from the http_rss_parser class, we get the following result:
class http_rss_parser(rss_parser.rss_parser): ... def http_header(self, client): if client.status[1] != "200": raise http_client.CloseConnection
1746 feeds (86.9%) successfully read: rss unknown: 1 feed rss 0.9: 55 feeds rss 0.91: 1623 feeds rss 0.91fn: 2 feeds rss 0.92: 22 feeds rss 1.0: 39 feeds rss 2.0: 4 feeds
Handling Redirection #
class http_rss_parser(rss_parser.rss_parser): ... def http_header(self, client): if client.status[1].startwith("3"): ... redirect ... location = client.header["location"]
Handling Other Status Codes
class http_rss_parser(rss_parser.rss_parser): ... def http_header(self, client): status = client.status[1] status_category = status[:1] if status_category == "3": ... redirect ... location = client.header["location"] elif status_category == "2": ... accept ... else: ...
Using Conditional Fetch #
Fetching Compressed Data #
In Progress: EffNews Part 6: Using the ElementTree Module to Parse RSS Files #
This section is not complete.
This article will be based on a number of online.effbot.org postings, including:
Parsing RSS 0.9x and 2.0 Files(dead link)Parsing RSS 1.0 Files(dead link)Parsing RSS 0.9 Files(dead link)
EffNews Addendas, Frequently Asked Questions, and Other Assorted Notes #
FAQ: Where’s the Code Archive?
October 7, 2002 | Fredrik Lundh
You can get a snapshot of the effnews #4 code base from the effbot.org downloads page.
For later additions, feel free to copy and paste from the articles (to select an entire script, triple-clicking on the first line of the script works fine in Internet Explorer).
Note: Adding Entity Support
October 7, 2002 | Fredrik Lundh
Many RSS feeds embed non-XML character entities in the description and title fields. This is allowed by the original 0.9 and 0.91 standards, but it’s unclear whether later standards really support this. Not that the standards matter here; feeds of all kinds use the entities, so we have to deal with them anyway.
The xmllib parser uses an entitydefs dictionary to translate entities to character strings. If an entity is not defined by this dictionary, the parser calls the unknown_entityref method. The following addition to our rss_parser class adds all standard HTML entities to the entitydefs dictionary when it’s first called, and replaces all other entities to an empty string.
class rss_parser(xmllib.XMLParser): ... htmlentitydefs = None def unknown_entityref(self, entity): if not self.htmlentitydefs: # lazy loading of entitydefs table import htmlentitydefs # make sure we don't overwrite entities already present in # the entitydefs dictionary (doing so will confuse xmllib) entitydefs = htmlentitydefs.entitydefs.copy() entitydefs.update(self.entitydefs) self.entitydefs = self.htmlentitydefs = entitydefs self.handle_data(self.entitydefs.get(entity, "")) ...
Note: Feed Statistics #
October 5, 2002 | Fredrik Lundh
Using a list of feeds from Syndic8.com (dead link), I’ve tried the current version of the RSS parser (including the entity support) on just over 2000 RSS feeds. The result isn’t very encouraging:
2010 feeds checked 137 feeds (6.8%) successfully read: rss 0.9: 17 feeds rss 0.91: 84 feeds rss 0.91fn: 2 feeds rss 0.92: 20 feeds rss 1.0: 10 feeds rss 2.0: 4 feeds
As it turns out, the problem isn’t so much the parser as the protocol layer; the current code only accepts responses if they’re using the text/xml content type. Here’s a breakdown of the feeds that returned a valid HTTP response. The following list shows the HTTP status code (200=OK) and the specified content type:
200 'text/plain; charset=utf-8': 1 feed 301 'text/html; charset=iso-8859-1': 1 feed 200 'text/html;charset=iso-8859-1': 1 feed 200 'text/xml; charset=utf-8': 1 feed 403 'text/html; charset=iso-8859-1': 1 feed 200 'text/XML': 1 feed 302 'text/html; charset=ISO-8859-1': 1 feed 200 'application/x-cdf': 1 feed 200 'application/unknown': 1 feed 200 'httpd/unix-directory': 2 feeds 200 'text/rdf': 2 feeds 200 'application/rss+xml': 2 feeds 200 'text/xml; charset=ISO-8859-1': 2 feeds 404 'text/html; charset=iso-8859-1': 3 feeds 200 'application/sgml': 3 feeds 302 'text/html; charset=iso-8859-1': 4 feeds 200 'text/html; charset=iso-8859-1': 4 feeds 200 'application/x-netcdf': 5 feeds 200 'text/plain; charset=ISO-8859-1': 7 feeds 200 'text/plain; charset=iso-8859-1': 8 feeds 200 'application/octet-stream': 10 feeds 200 'application/xml': 18 feeds 200 'text/html': 42 feeds 200 'text/xml': 191 feeds 200 'text/plain': 1660 feeds
Most feeds are returned as text/plain, and many use little-known (or unregistered) content types. The charset parameter is also somewhat common.
If we remove the check for content type from the http_rss_parser class, we get the following result:
1746 feeds (86.9%) successfully read: rss unknown: 1 feed rss 0.9: 55 feeds rss 0.91: 1623 feeds rss 0.91fn: 2 feeds rss 0.92: 22 feeds rss 1.0: 39 feeds rss 2.0: 4 feeds
There’s still 264 feeds that cannot be read by the current parser. To figure out what (if anything) is wrong with the parser, we need to be able to extract more status information from the parser.