Python Unicode Objects
Some Observations on Working With Non-ASCII Character Sets
This note provides some brief information on best practices for working with non-ASCII data in Python 2.0 and later. As everything else on this site, this is a work in progress.
Updated June 21, 2004 | February 11, 2002 | Fredrik Lundh
Python’s Unicode string type stores characters from the Unicode character set. In this set, each distinct character has its own number, the code point. Unicode supports more than one million code points. Unicode characters don’t have an encoding; each character is represented by its code. The Unicode string type uses some unknown mechanism to store the characters; in your Python code, Unicode strings simply appear as sequences of characters, just like 8-bit strings appear as sequences of bytes.
Observations:
-
Text files always contain encoded text, not characters. Each character in the text is encoded as one or more bytes in the file.
-
Most popular encodings (UTF-8, ISO-8859-X, etc) are supersets of ASCII. This means that the first 128 characters have the usual meaning, and that the usual characters are used for line endings. In other words, readline() will work just fine.
-
You can mix Python Unicode strings with 8-bit Python strings, as long as the 8-bit string only contains ASCII characters. A Unicode-aware library may chose to use 8-bit strings for text that only contains ASCII, to save space and time.
-
If you read a line of text from a file, you get bytes, not characters.
-
To decode an encoded string into a string of well-defined characters, you have to know what encoding it uses.
-
To decode a string, use the decode() method on the input string, and pass it the name of the encoding:
fileencoding = "iso-8859-1" raw = file.readline() txt = raw.decode(fileencoding)
(the result is a Python Unicode string).
The decode method was added in Python 2.2. In earlier versions (or if you think it reads better), use the unicode constructor instead:
txt = unicode(raw, fileencoding)
-
Python’s regular expression engine supports Unicode. You can apply the same pattern to either 8-bit (encoded) or Unicode strings. To create a regular expression pattern that uses Unicode character classes for \w (and \s, and \b), use the “(?u)” flag prefix, or the re.UNICODE flag:
pattern = re.compile("(?u)pattern") pattern = re.compile("pattern", re.UNICODE)
-
To write a Unicode string to a file or other device, you have to convert it to the encoding used by the file. The encode method converts from Unicode to an encoded string.
out = txt.encode(encoding)
If the string contains characters that cannot be represented in the given encoding, Python raises an exception. You can change this by passing in a second argument to encode:
# skip bad chars out = txt.encode(encoding, "ignore") # replace bad chars with "?" out = txt.encode(encoding, "replace")
For more on string encoding, see Converting Unicode Strings to 8-bit Strings.
- To print a Unicode string to your output device, you have to convert it to the encoding used by your terminal. The encode() method converts from Unicode back to an encoded string. You can use the locale.getdefaultlocale() function to get the current output encoding.
import locale language, output_encoding = locale.getdefaultlocale() print txt.encode(output_encoding)
There are lots of shortcuts in Python, including coded streams, using default locales for pattern matching, ISO-8859-1 as a subset of Unicode, etc, but that’s outside the scope of this note. At least for the moment.