Grabbing del.icio.us posts with Python

Fredrik Lundh | September 24, 2006 | Originally posted to online.effbot.org

The del.icio.us link management site offers a convenient JSON interface for fetching the last few posts as a JSON object. While JSON is designed for use in JavaScript environments, it turns out that the JSON produced by del.icio.us is really easy to use from Python.

For example, fetching http://del.icio.us/feeds/json/effbot?raw URL gives you the 15 most recent additions to my del.icio.us feed as a single JSON object. With some extra linefeeds added for clarity, the object might look something like this:

[{"u":"http://faassen.n--tree.net/blog/view/weblog/2006/02/24/0",
  "d":"Martijn Faassen: lxml and (c)ElementTree",
  "t":["python","xml","elementtree","effbot:link","date:20060224"]},
 {"u":"http://article.gmane.org/gmane.comp.python.tutor/24986",
  "d":"Danny Yoo: elementtree mini-tutorial",
  "t":["python","xml","elementtree","effbot:link","date:20050524"]},
  ...

This looks a lot like Python, of course. In fact, it’s perfectly compatible with Python’s syntax for dictionaries, lists, and ordinary strings. To convert it into a Python object, you can simply pass it to eval:

>>> import urllib, pprint
>>> url = "http://del.icio.us/feeds/json/effbot?raw"
>>> pprint.pprint(eval(urllib.urlopen(url).read()))
[{'d': 'Martijn Faassen: lxml and (c)ElementTree',
  't': ['python', 'xml', 'elementtree', 'effbot:link', 'date:20060224'],
  'u': 'http://faassen.n--tree.net/blog/view/weblog/2006/02/24/0'},
 {'d': 'Danny Yoo: elementtree mini-tutorial',
  't': ['python', 'xml', 'elementtree', 'effbot:link', 'date:20050524'],
  'u': 'http://article.gmane.org/gmane.comp.python.tutor/24986'},
  ...
]

Not bad. A complete del.icio.us post grabber in what’s basically one line of Python.

Well, almost complete, at least. The JSON object uses UTF-8 encoding for non-ASCII text, so to be on the safe side, you should decode the strings before using them. To deal with this, and make the data a little easier to use, you can use a wrapper to represent the individual posts:

import urllib

def utf8(s):
    return unicode(s, "utf-8")

class Post(object):
    def __init__(self, item):
        self.link = utf8(item["u"])
        self.title = utf8(item["d"])
        self.description = utf8(item.get("n", ""))
        self.tags = map(utf8, item["t"])

def getposts(user):
    url = "http://del.icio.us/feeds/json/%s/?raw" % user
    return map(Post, eval(urllib.urlopen(url).read()))

for post in getposts("effbot"):
    print post.link, post.tags

The del.icio.us JSON interface provides two additional features; you can fetch up to 100 posts in each requests, and you can filter on individual tags or tag combinations. Here’s an enhanced version of the getposts function that takes optional tag and count arguments:

def getposts(user, tag="", count=15):
    if isinstance(tag, tuple):
        tag = "+".join(tag)
    url = "http://del.icio.us/feeds/json/%s/%s?raw&count=%d" % (
        user, tag, count
        )
    return map(Post, eval(urllib.urlopen(url).read()))

The tag argument can be either a single string or a tuple of strings. For example, to get all my pil-related links, you can use:

>>> for post in getposts("effbot", "pil"):
>>>     print post.link

http://louhi.kempele.fi/~skyostil/uv/fretsonfire/
http://effbot.python-hosting.com/milestone/pil-1.1.6-beta
...

To get an “official” elementtree bibliography, you can specify both effbot:link (which I’m using for bibliographic entries) and elementtree:

>>> for post in getposts("effbot", ("effbot:link", "elementtree"), 100):
>>>     print post.link

http://faassen.n--tree.net/blog/view/weblog/2006/02/24/0
http://article.gmane.org/gmane.comp.python.tutor/24986
...

As can be seen in the raw dumps above, bibliography links also include date: tags. The following snippet sorts the list by publication date:

posts = getposts("effbot", ("effbot:link", "elementtree"), 100)

def getdate(post):
    for tag in post.tags:
        if tag.startswith("date:"):
            return tag[5:]
    return None

posts.sort(key=getdate)

for post in posts:
    print post.link

Running this gives us:

http://www.xml.com/pub/a/2003/02/12/py-xml.html
http://www-128.ibm.com/developerworks/library/x-matters28/
http://www.idealliance.org/papers/dx_xml03/papers/06-02-03/06-02-03.html
http://www.xml.com/pub/a/2004/06/30/py-xml.html
...

Generating HTML instead is straight-forward; running:

import cgi

print "<ul>"
for post in posts:
    print "<li><a href='%s'>%s</a>" % (post.link, cgi.escape(post.title))
print "</ul>"

gives us:

Uche Ogbuji: Simple XML Processing With elementtree
David Mertz: Process XML in Python with ElementTree
~~Uche Ogbuji: Python Paradigms for XML~~ (dead link)
Uche Ogbuji: XML Namespaces Support in Python Tools, Part Three
Uche Ogbuji: Practical SAX Notes
Joseph Reagle: XML ElementTree Data Model
Danny Yoo: elementtree mini-tutorial
Martijn Faassen: lxml and (c)ElementTree
Andrew Dalke: PyProtocols for output generation

To simplify even more, you can move the HTML anchor code into the Post class; by adding a __str__ method, you can simply print the post object to get a link:

class Post(object):
    def __init__(self, item):
        self.link = utf8(item["u"])
        self.title = utf8(item["d"])
        self.description = utf8(item.get("n", ""))
        self.tags = map(utf8, item["t"])
    def __str__(self):
         return "<a href='%s'>%s</a>" % (self.link, cgi.escape(self.title))

...

print "<ul>"
for post in posts:
    print "<li>", post
print "</ul>"