Converting Unicode Strings to 8-bit Strings
Fredrik Lundh | January 2006
A Unicode string holds characters from the Unicode character set.
If you want an 8-bit string, you need to decide what encoding you want to use. Common encodings are US-ASCII (which is the default if you convert from Unicode to 8-bit strings in Python), ISO-8859-1 (aka Latin-1), and UTF-8 (a variable-width encoding that can represent all Unicode strings).
For example, if you want Latin-1 strings, you can use one of:
s = u.encode("iso-8859-1") # fail if some character cannot be converted s = u.encode("iso-8859-1", "replace") # instead of failing, replace with ? s = u.encode("iso-8859-1", "ignore") # instead of failing, leave it out
If you want an ASCII string, replace “iso-8859-1” above with “ascii” or “us-ascii”.
If you want to output the data to a web browser or an XML file, you can use:
import cgi s = cgi.escape(u).encode("ascii", "xmlcharrefreplace")
The cgi.escape function converts reserved characters (< > and &) to character entities (<, > and &), and the xmlcharrefreplace flag tells the encoder to use character references (&#nn;) for any character that cannot be encoded in the given encoding. The browser (or XML parser) at the other end will convert things back to Unicode.
Note that cgi.escape doesn’t escape quotes by default. To use the value in an attribute, you need to pass in an extra flag to escape, and put the result in double quotes:
s = 'attr="%s"' % cgi.escape(u,1).encode("ascii", "xmlcharrefreplace")
The unaccent.py (dead link) script shows how to strip off accents from latin characters:
import unicodedata, sys CHAR_REPLACEMENT = { # latin-1 characters that don't have a unicode decomposition 0xc6: u"AE", # LATIN CAPITAL LETTER AE 0xd0: u"D", # LATIN CAPITAL LETTER ETH 0xd8: u"OE", # LATIN CAPITAL LETTER O WITH STROKE 0xde: u"Th", # LATIN CAPITAL LETTER THORN 0xdf: u"ss", # LATIN SMALL LETTER SHARP S 0xe6: u"ae", # LATIN SMALL LETTER AE 0xf0: u"d", # LATIN SMALL LETTER ETH 0xf8: u"oe", # LATIN SMALL LETTER O WITH STROKE 0xfe: u"th", # LATIN SMALL LETTER THORN } ## # Translation dictionary. Translation entries are added to this # dictionary as needed. class unaccented_map(dict): ## # Maps a unicode character code (the key) to a replacement code # (either a character code or a unicode string). def mapchar(self, key): ch = self.get(key) if ch is not None: return ch de = unicodedata.decomposition(unichr(key)) if de: try: ch = int(de.split(None, 1)[0], 16) except (IndexError, ValueError): ch = key else: ch = CHAR_REPLACEMENT.get(key, key) self[key] = ch return ch if sys.version >= "2.5": # use __missing__ where available __missing__ = mapchar else: # otherwise, use standard __getitem__ hook (this is slower, # since it's called for each character) __getitem__ = mapchar if __name__ == "__main__": text = u""" "Jo, når'n da ha gått ett stôck te, så kommer'n te e å, å i åa ä e ö." "Vasa", sa'n. "Å i åa ä e ö", sa ja. "Men va i all ti ä dä ni säjer, a, o?", sa'n. "D'ä e å, vett ja", skrek ja, för ja ble rasen, "å i åa ä e ö, hörer han lite, d'ä e å, å i åa ä e ö." "A, o, ö", sa'n å dämmä geck'en. Jo, den va nôe te dum den. (taken from the short story "Dumt fôlk" in Gustaf Fröding's "Räggler å paschaser på våra mål tå en bonne" (1895). """ print text.translate(unaccented_map()) # note that non-letters are passed through as is; you can use # encode("ascii", "ignore") to get rid of them. alternatively, # you can tweak the translation dictionary to return None for # characters >= "\x80". map = unaccented_map() print repr(u"12\xbd inch".translate(map)) print repr(u"12\xbd inch".translate(map).encode("ascii", "ignore"))