Using Unicode characters in HTML

Fredrik Lundh | October 2003 | Originally posted to online.effbot.org

Jeremy Hylton writes: “I have been struggling with Unicode for my weblog aggregator. There are several feeds that include Unicode data in the title or description. I tried to generate HTML output with a UTF-8 encoding, but that didn’t seem to work.”

Later in that article, Jeremy asks “Isn’t there some way to specify that stdout should use iso-8859-1 to encode all Unicode strings?”

There is some stuff in the codecs module that you can tweak for this purpose, but it’s probably easier to write your own file wrapper. Here’s one way to do it:

class encoder:
    def __init__(self, file, encoding="iso-8859-1"):
        self.file = file
        self.encoding = encoding
    def write(self, text):
        self.file.write(text.encode(self.encoding, "replace"))

sys.stdout = encoder(sys.stdout)

The “replace” argument tells the encoder to replace non-ISO characters with question marks, instead of raising an exception.

A better solution would be to encode non-ISO characters using HTML character references. It’s fairly easy to do this using e.g. a regular expression (use re.sub with a callback that maps runs of non-ISO characters to references), but Python 2.3 makes it even easier: simply change “replace” to “xmlcharrefreplace”, and the encoder replaces any character that it cannot encode with the corresponding character reference:

class encoder:
    def __init__(self, file, encoding="iso-8859-1"):
        self.file = file
        self.encoding = encoding
    def write(self, text):
        self.file.write(text.encode(self.encoding, "xmlcharrefreplace"))

sys.stdout = encoder(sys.stdout)

To make things even more robust, consider writing the document as plain ASCII, and use character references for all non-ASCII characters. ISO-8859-1 is the default encoding for HTTP, but HTML character encodings is something of a mess, and it might be better to be safe than sorry:

class encoder:
    def __init__(self, file):
        self.file = file
    def write(self, text):
        self.file.write(text.encode("ascii", "xmlcharrefreplace"))

sys.stdout = encoder(sys.stdout)

print u"voffo gör di på detta viset?"

This example prints:

voffo g&#246;r di p&#229; detta viset?

Unfortunately, this isn’t enough to solve Jeremy’s problem; it handles non-ASCII characters in titles and descriptions, but what if the title or description contains a less-than character (<) or an ampersand (&)?

Maybe the explicit encoding idea wasn’t so bad, after all. Just make sure to encode the text strings, instead of the resulting HTML:

def encode(text):
    text = text.replace("&", "&amp;") # must be first!
    text = text.replace("<", "&lt;")
    text = text.replace(">", "&gt;")
    text = text.replace("'", "&apos;")
    text = text.replace('"', "&quot;")
    return text.encode("ascii", "xmlcharrefreplace")

print "<a href='%s'>%s</a>" % (encode(guid), encode(title))

(if you want to speed things up, you can leave out the > line, and perhaps also the ' line; for the latter to work, make sure to only use single quotes around HTML attributes).

Alternatively, you can wrap all non-HTML strings in a special wrapper class, and do the encoding in the __str__ method:

class text:
    def __init__(self, text):
        self.text = text
    def __str__(self):
        return encode(self.text)

print "<a href='%s'>%s</a>" % (text(guid), text(title))

If you plan to print the same string in multiple places, the wrapper approach allows you to wrap the strings early on, and forget about the encoding when you print the strings:

guid = text(article.getguid())
title = text(article.gettitle())

...

print "<a href='%s'>%s</a>" % (guid, title)

titles.append(title)

...

for title in titles:
    print "%s<br />" % title

(again, if you want to speed things up, you can encode the string in the constructor instead of doing over and over again in the __str__ method.)