['term', 'extraction']
Fredrik Lundh | November 2005 | Originally posted to online.effbot.org
Erik Stattin linked to this page (dead link) which led me to this page (dead link) which reminded me of this which inspired me to whip up this little script:
# File: YahooTermExtraction.py # # An interface to Yahoo's Term Extraction service: # # http://developer.yahoo.net/search/content/V1/termExtraction.html # # "The Term Extraction Web Service provides a list of significant # words or phrases extracted from a larger content." # import urllib try: from xml.etree import ElementTree # 2.5 and later except ImportError: from elementtree import ElementTree URI = "http://api.search.yahoo.com" URI = URI + "/ContentAnalysisService/V1/termExtraction" def termExtraction(appid, context, query=None): d = dict( appid=appid, context=context.encode("utf-8") ) if query: d["query"] = query.encode("utf-8") result = [] f = urllib.urlopen(URI, urllib.urlencode(d)) for event, elem in ElementTree.iterparse(f): if elem.tag == "{urn:yahoo:cate}Result": result.append(elem.text) return result
Usage:
>>> from YahooTermExtraction import termExtraction >>> appid = "/your app id/" >>> uri = "/some uri/" >>> text = urllib.urlopen(uri).read() >>> termExtraction(appid, text)[-5:] ['horrible picture', 'logo', 'spammer', 'moron', 'cat mouse']
(For best results, you should probably run the text through a HTML-to-text conversion before you send it to Yahoo. Some variation of this script might be useful.)