Python 字符编码转换

花家地南街八号 2014-03-20

展开全文

python 内部使用 unicode 表示字符串，

自然，当需要编码转换时候，要用 unicode 作为“中间编码”

eg ：

gbk 转 utf-8时，

gbk --> unicode --> utf-8

分解为两个步骤，

1. gbk --> unicode

python 语法：你的字符串.decode("gbk")

2. unicode --> utf-8

python 语法：你的字符串.decode("gbk").encode("utf-8")

对于已经是unicode编码的字符串，可以直接encode，而不能 decode了。

这种情况下，需要代码中给出判断，

可以用python __builtin__.py 中提供的函数：

isinstance（）

去判断python范围内的任何“类型”，当然也可以判断是不是 unicode：

if isinstance(yourchar, unicode):

communicate = yourchar.encode("utf-8") #直接encode

else:

import chardet #chardet.detect 可以试探字符类型，估计是某种字符的概率

type_decode = chardet.detect(communicate)["encoding"]

communicate = communicate.decode(type_decode, errors='ignore').encode("utf-8")

errors：

因为unicode 只有 128 那么长，所以为了“容错”，这里有3个级别，

errors = 'strict' #很严格的，出错（多于128）就异常

errors = 'replace' # add U+FFFD, ‘REPLACEMENT CHARACTER’

error = 'ignore' # 用短的替换

eg：

>>>unicode('\x80abc',errors='strict')

---------------------------------------------------------------------------

UnicodeDecodeError Traceback (most recent call last)

/home/tom/<ipython-input-1-8eef8e091bcd> in <module>()

----> 1 unicode('\x80abc',errors='strict')

UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: ordinal not in range(128)

>>> unicode('\x80abc',errors='replace')

>>> u'\ufffdabc'

>>> unicode('\x80abc',errors='ignore')

>>> u'abc'

见这里：

http://docs./2/howto/unicode.html#the-unicode-type

History of Character Codes:

http://docs./2/howto/unicode.html#history-of-character-codes

In the 1980s, almost all personal computers were 8-bit, meaning that bytes could hold values ranging from 0 to 255. ASCII codes only went up to 127, so some machines assigned values between 128 and 255 to accented characters. Different machines had different codes, however, which led to problems exchanging files. Eventually various commonly used sets of values for the 128-255 range emerged. Some were true standards, defined by the International Standards Organization, and some were de facto conventions that were invented by one company or another and managed to catch on.

255 characters aren’t very many. For example, you can’t fit both the accented characters used in Western Europe and the Cyrillic alphabet used for Russian into the 128-255 range because there are more than 127 such characters.

You could write files using different codes (all your Russian files in a coding system called KOI8, all your French files in a different coding system called Latin1), but what if you wanted to write a French document that quotes some Russian text? In the 1980s people began to want to solve this problem, and the Unicode standardization effort began.

Unicode started out using 16-bit characters instead of 8-bit characters. 16 bits means you have 2^16 = 65,536 distinct values available, making it possible to represent many different characters from many different alphabets; an initial goal was to have Unicode contain the alphabets for every single human language. It turns out that even 16 bits isn’t enough to meet that goal, and the modern Unicode specification uses a wider range of codes, 0-1,114,111 (0x10ffff in base-16).

There’s a related ISO standard, ISO 10646. Unicode and ISO 10646 were originally separate efforts, but the specifications were merged with the 1.1 revision of Unicode.

(This discussion of Unicode’s history is highly simplified. I don’t think the average Python programmer needs to worry about the historical details; consult the Unicode consortium site listed in the References for more information.)