I have not yet played with python 3, which sounds like it makes working with Unicode easier.
In python 2.x, working with Unicode can be annoying, but when you remember this rule of thumb, it's much easier and allows you to keep your sanity:
Always use unicode strings internally. Decode whatever you read/receive. Encode whatever you write/send.
I've recently done some work on a project to generate ReStructuredText
(as an intermediate form on the way to generating HTML). The input data
has Unicode sprinkled throughout (in UTF-8). I kept getting
UnicodeDecodeError: 'ascii' codec can't decode byte
exceptions in
various places, until I applied that rule everywhere:
- When reading, decode from UTF-8.
- Use Unicode strings
u'E.g. this formatted %s string' % (decoded_string1, decoded_string2)
internally -- everywhere. - When writing, encode to UTF-8.
Hat tip to nosklo on StackOverflow for mentioning this simple rule.