Preserve Your Sanity When Dealing with Unicode in Python

I have not yet played with python 3, which sounds like it makes working with Unicode easier.

In python 2.x, working with Unicode can be annoying, but when you remember this rule of thumb, it's much easier and allows you to keep your sanity:

Always use unicode strings internally. Decode whatever you read/receive. Encode whatever you write/send.

I've recently done some work on a project to generate ReStructuredText (as an intermediate form on the way to generating HTML). The input data has Unicode sprinkled throughout (in UTF-8). I kept getting UnicodeDecodeError: 'ascii' codec can't decode byte exceptions in various places, until I applied that rule everywhere:

  • When reading, decode from UTF-8.
  • Use Unicode strings u'E.g. this formatted %s string' % (decoded_string1, decoded_string2) internally -- everywhere.
  • When writing, encode to UTF-8.

Hat tip to nosklo on StackOverflow for mentioning this simple rule.

Posted on 2011-10-05 by brian in python .
Comments on this post are closed. If you have something to share, please send me email.