Grabbing del.icio.us posts with Python
Fredrik Lundh | September 24, 2006 | Originally posted to online.effbot.org
The del.icio.us link management site offers a convenient JSON interface for fetching the last few posts as a JSON object. While JSON is designed for use in JavaScript environments, it turns out that the JSON produced by del.icio.us is really easy to use from Python.
For example, fetching http://del.icio.us/feeds/json/effbot?raw URL gives you the 15 most recent additions to my del.icio.us feed as a single JSON object. With some extra linefeeds added for clarity, the object might look something like this:
[{"u":"http://faassen.n--tree.net/blog/view/weblog/2006/02/24/0", "d":"Martijn Faassen: lxml and (c)ElementTree", "t":["python","xml","elementtree","effbot:link","date:20060224"]}, {"u":"http://article.gmane.org/gmane.comp.python.tutor/24986", "d":"Danny Yoo: elementtree mini-tutorial", "t":["python","xml","elementtree","effbot:link","date:20050524"]}, ...
This looks a lot like Python, of course. In fact, it’s perfectly compatible with Python’s syntax for dictionaries, lists, and ordinary strings. To convert it into a Python object, you can simply pass it to eval:
>>> import urllib, pprint >>> url = "http://del.icio.us/feeds/json/effbot?raw" >>> pprint.pprint(eval(urllib.urlopen(url).read())) [{'d': 'Martijn Faassen: lxml and (c)ElementTree', 't': ['python', 'xml', 'elementtree', 'effbot:link', 'date:20060224'], 'u': 'http://faassen.n--tree.net/blog/view/weblog/2006/02/24/0'}, {'d': 'Danny Yoo: elementtree mini-tutorial', 't': ['python', 'xml', 'elementtree', 'effbot:link', 'date:20050524'], 'u': 'http://article.gmane.org/gmane.comp.python.tutor/24986'}, ... ]
Not bad. A complete del.icio.us post grabber in what’s basically one line of Python.
Well, almost complete, at least. The JSON object uses UTF-8 encoding for non-ASCII text, so to be on the safe side, you should decode the strings before using them. To deal with this, and make the data a little easier to use, you can use a wrapper to represent the individual posts:
import urllib def utf8(s): return unicode(s, "utf-8") class Post(object): def __init__(self, item): self.link = utf8(item["u"]) self.title = utf8(item["d"]) self.description = utf8(item.get("n", "")) self.tags = map(utf8, item["t"]) def getposts(user): url = "http://del.icio.us/feeds/json/%s/?raw" % user return map(Post, eval(urllib.urlopen(url).read())) for post in getposts("effbot"): print post.link, post.tags
The del.icio.us JSON interface provides two additional features; you can fetch up to 100 posts in each requests, and you can filter on individual tags or tag combinations. Here’s an enhanced version of the getposts function that takes optional tag and count arguments:
def getposts(user, tag="", count=15): if isinstance(tag, tuple): tag = "+".join(tag) url = "http://del.icio.us/feeds/json/%s/%s?raw&count=%d" % ( user, tag, count ) return map(Post, eval(urllib.urlopen(url).read()))
The tag argument can be either a single string or a tuple of strings. For example, to get all my pil-related links, you can use:
>>> for post in getposts("effbot", "pil"): >>> print post.link
http://louhi.kempele.fi/~skyostil/uv/fretsonfire/ http://effbot.python-hosting.com/milestone/pil-1.1.6-beta ...
To get an “official” elementtree bibliography, you can specify both effbot:link (which I’m using for bibliographic entries) and elementtree:
>>> for post in getposts("effbot", ("effbot:link", "elementtree"), 100): >>> print post.link
http://faassen.n--tree.net/blog/view/weblog/2006/02/24/0 http://article.gmane.org/gmane.comp.python.tutor/24986 ...
As can be seen in the raw dumps above, bibliography links also include date: tags. The following snippet sorts the list by publication date:
posts = getposts("effbot", ("effbot:link", "elementtree"), 100) def getdate(post): for tag in post.tags: if tag.startswith("date:"): return tag[5:] return None posts.sort(key=getdate) for post in posts: print post.link
Running this gives us:
http://www.xml.com/pub/a/2003/02/12/py-xml.html http://www-128.ibm.com/developerworks/library/x-matters28/ http://www.idealliance.org/papers/dx_xml03/papers/06-02-03/06-02-03.html http://www.xml.com/pub/a/2004/06/30/py-xml.html ...
Generating HTML instead is straight-forward; running:
import cgi print "<ul>" for post in posts: print "<li><a href='%s'>%s</a>" % (post.link, cgi.escape(post.title)) print "</ul>"
gives us:
- Uche Ogbuji: Simple XML Processing With elementtree
- David Mertz: Process XML in Python with ElementTree
Uche Ogbuji: Python Paradigms for XML(dead link)- Uche Ogbuji: XML Namespaces Support in Python Tools, Part Three
- Uche Ogbuji: Practical SAX Notes
- Joseph Reagle: XML ElementTree Data Model
- Danny Yoo: elementtree mini-tutorial
- Martijn Faassen: lxml and (c)ElementTree
- Andrew Dalke: PyProtocols for output generation
To simplify even more, you can move the HTML anchor code into the Post class; by adding a __str__ method, you can simply print the post object to get a link:
class Post(object): def __init__(self, item): self.link = utf8(item["u"]) self.title = utf8(item["d"]) self.description = utf8(item.get("n", "")) self.tags = map(utf8, item["t"]) def __str__(self): return "<a href='%s'>%s</a>" % (self.link, cgi.escape(self.title)) ... print "<ul>" for post in posts: print "<li>", post print "</ul>"