Monday, October 22, 2007

Unicode Madness

I don't think I completely understand unicode.

>>> x = u'\xbb'
>>> x.encode('ascii', 'ignore')
''
>>> x.decode('ascii', 'ignore')
Traceback (most recent call last):
File "", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\xbb' in position 0: ordinal not in range(128)
>>>


Why does the decode call raise an exception even though I've asked it to 'ignore' Unicode problems?

5 comments:

Tim Golden said...

This is what I consider a wart, but I'm sure it seemed like a good idea at the time.

A "decode" in Python is going *from* bytes *into* unicode. You're starting from unicode, so the code has to get into bytes first in order to get back into unicode (!) and does an *implicit* encode. Which fails, because it's using ascii and your string is non-ascii-encodable.

Quite why the unicode object even *has* a decode method I'm not sure. It seems to go against the "explicit is better than implicit" bit, but I presume there was some rationale at the time.

Fuzzyman said...

I guess API consitency is one reason that unicode.decode exists - but there are also some pretty weird encodings out there and if you have a unicode data source then you may want to decode from base64 into a unicode string I suppose?

Lennart Regebro said...

The decode is there because both unicode and string inherits from a common basecalss, that has the decode and encode methods, I've been told.

I think the unicode.decode and string.encode are going away in Python 3.

In short, you get an error because you are doing, decoding uniode into unicode, makes no sense. Python here breaks it's explicit rule, and tries to guess what you want, and it guesses that the unicode thingy you gave it is supposed to be a string. So it tries to convert it to a string. That's what fails.

"if you have a unicode data source then you may want to decode from base64 into a unicode string I suppose?"

Well, yeah, but that's base64 encoding, not unicode encoding, so I'd expect you want to use b64decode for that...

Christian said...

Educational stuff here - thanks for the post and the ensuing discussion.

kumar said...

Did you mean to do this:

>>> x = u'\xbb'
>>> ascii_x = x.encode('ascii', 'ignore')
>>> ascii_x
''
>>> ascii_x.decode('ascii', 'ignore')
u''

? ...which shows that the unicode character is ignored instead of encoded.

I never understood unicode better than when I posted a blog entry *as if* I understood it, then proceeded to get schooled in the comments (and even hazed by reddit users). Luckily, I corrected the post along the way! :)

here: What I Thought I Knew About Unicode in Python Amounted To Nothing

Popular Posts