Comments on Entity Crisis: Unicode Madness

Did you mean to do this:>>> x = u'\xbb'>>> ascii_x...

2007-10-23T06:26:00.000+08:00

Did you mean to do this:

>>> x = u'\xbb'
>>> ascii_x = x.encode('ascii', 'ignore')
>>> ascii_x
''
>>> ascii_x.decode('ascii', 'ignore')
u''

? ...which shows that the unicode character is ignored instead of encoded.

I never understood unicode better than when I posted a blog entry *as if* I understood it, then proceeded to get schooled in the comments (and even hazed by reddit users). Luckily, I corrected the post along the way! :)

here: What I Thought I Knew About Unicode in Python Amounted To Nothing

Educational stuff here - thanks for the post and t...

2007-10-23T04:43:00.000+08:00

Educational stuff here - thanks for the post and the ensuing discussion.

The decode is there because both unicode and strin...

2007-10-22T18:02:00.000+08:00

The decode is there because both unicode and string inherits from a common basecalss, that has the decode and encode methods, I've been told.

I think the unicode.decode and string.encode are going away in Python 3.

In short, you get an error because you are doing, decoding uniode into unicode, makes no sense. Python here breaks it's explicit rule, and tries to guess what you want, and it guesses that the unicode thingy you gave it is supposed to be a string. So it tries to convert it to a string. That's what fails.

"if you have a unicode data source then you may want to decode from base64 into a unicode string I suppose?"

Well, yeah, but that's base64 encoding, not unicode encoding, so I'd expect you want to use b64decode for that...

I guess API consitency is one reason that unicode....

2007-10-22T17:38:00.000+08:00

I guess API consitency is one reason that unicode.decode exists - but there are also some pretty weird encodings out there and if you have a unicode data source then you may want to decode from base64 into a unicode string I suppose?

This is what I consider a wart, but I'm sure it se...

2007-10-22T15:54:00.000+08:00

This is what I consider a wart, but I'm sure it seemed like a good idea at the time.

A "decode" in Python is going *from* bytes *into* unicode. You're starting from unicode, so the code has to get into bytes first in order to get back into unicode (!) and does an *implicit* encode. Which fails, because it's using ascii and your string is non-ascii-encodable.

Quite why the unicode object even *has* a decode method I'm not sure. It seems to go against the "explicit is better than implicit" bit, but I presume there was some rationale at the time.