Using Non-Standard Encodings in cElementTree
Updated December 15 | December 1, 2005 | Fredrik Lundh
Update 2005-12-04: Changed to use codecs.open instead of plain open, to avoid problems with variable-width encodings. Thanks to “mark_m”.
Update 2005-12-15: This has been fixed in cElementTree 1.0.5, which supports all 8-bit encodings provided by Python’s Unicode implementation.
Older versions of cElementTree (1.0.4 and earlier) only supports the encodings provided by the expat library itself:
- UTF-8
- UTF-16
- US-ASCII
- ISO-8859-1
Support for more encodings will be added to a future release.
To work around this in the current version, you can use the XMLParser class directly, and “recode” the data stream in Python:
import cElementTree as ET import codecs def myparser(file, encoding): f = codecs.open(file. "r", encoding) p = ET.XMLParser(encoding="utf-8") while 1: s = f.read(65536) if not s: break p.feed(s.encode("utf-8")) return ET.ElementTree(p.close()) tree = myparser("example.xml", "windows-1252")
To determine the encoding used in the file, you can use something like Paul Prescod’s Auto-detect XML encoding recipe.