Extracting plain text from HTML
Fredrik Lundh | August 2003 | Originally posted to online.effbot.org
As some readers may have noticed, my RSS feed no longer includes full articles; instead, each item contains the first 50-100 words from the corresponding article, as plain unstyled text. I may switch back again when I stop posting “standard python library” articles…
If you want to use something similar in your feeds, here’s the code that does the work. Tweak as necessary:
def textify(html_snippet, maxwords=50): import formatter, htmllib, StringIO, string class Parser(htmllib.HTMLParser): def anchor_end(self): self.anchor = None class Formatter(formatter.AbstractFormatter): pass class Writer(formatter.DumbWriter): def send_label_data(self, data): self.send_flowing_data(data) self.send_flowing_data(" ") o = StringIO.StringIO() p = Parser(Formatter(Writer(o))) p.feed(html_snippet) p.close() words = o.getvalue().split() if len(words) <= 2*maxwords: return string.join(words) return string.join(words[:maxwords]) + " ..."
The HTMLParser subclass disables anchor footnotes; the DumbWriter subclass makes sure that HTML list items have proper labels (or in other words, the subclass works around a bug in the standard library).