python编码错误：UnicodeDecodeError: 'utf8' codec can't decode

ProgramBird 2014-07-14

展开全文

这个是在写hive的map脚本时遇到的，基本情况如下：

map文件中调用了其他同学的公用函数，在hive脚本中运行时，输出报错。但是，如果单独运行python，并输出到文件，就没有错误了，感觉异常的诡异，其中hive报错如下：

Traceback (most recent call last): File "search_map_script_py", line 114, in <module> dataStreamProcess(line) File "search_map_script_py", line 108, in dataStreamProcess print '%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s' % (pbrand, ptype, psize, res, softv, organic, appfrom, searchtype, cuid, search_tm) UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 5: ordinal not in range(128)

于是在map脚本中添加sys模块，设定了输出的环境为utf8：

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

但是，初步的问题解决后，不部分时间脚本运行正常，但是，时而出错。单独运行python，并输出到文本，一切正常。其中hive脚本报错如下：

Traceback (most recent call last): File "searchBox_user_map_script_py", line 212, in <module> dataStreamProcess(line) File "user_map_script_py", line 201, in dataStreamProcess print '%s\t%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s' % (baiduapp_uid, imei, actionType, os, os_ver, net, resolution,event_country,event_province,event_city,softv,app_from, loc_country, loc_province, loc_city, phoneBrand, phoneType, phoneSize, netOper, year, month, day, tstamp) UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 25: invalid start byte

以上问题出现后，纠结了大概一周左右，借下面博主的文章，顺利解决：

print '%s\t%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s' % (baiduapp_uid, imei, actionType, os, os_ver, net, resolution,event_country,event_province,event_city,softv,app_from, loc_country, loc_province, loc_city, unicode(phoneBrand).encode(‘utf-8’), unicode(phoneType).encode(‘utf-8’), phoneSize, netOper, year, month, day, tstamp

其中，之前已经判断出错字段为phoneBrand, phoneType。

--------------------以下为博主原文-----------------------------

【已解决】Python脚本语法错误：SyntaxError: (unicode error) 'utf8' codec can't decode byte 0xc0 in position 0: invalid start byte

【背景】

一个python脚本，运行的时候出错了：

D:\tmp\WordPress\Others\to_wp\hi-baidu-mover_v2>hi-baidu-mover_v2011-12-22-office.py -fhttp://hi.baidu.com/recommend_music/blog/item/de233143bd84211a72f05deb.html -l 1
File "D:\tmp\WordPress\Others\to_wp\hi-baidu-mover_v2\hi-baidu-mover_v2011-12-22-office.py", line 869
cat_no_unicode = opt_no_unicode.replace(u'类别：', '')
SyntaxError: (unicode error) 'utf8' codec can't decode byte 0xc0 in position 0: invalid start byte

【解决过程】

1.觉得很奇怪的一点是，这个脚本，一直是可以正常运行的，怎么突然就出现了语法错误了。

然后发现是带有中文的代码，会出现错误。

第一反应是，可能是安装了那个python的Python Imaging Library (PIL)库，然后影响了当前的python 2.7.2版本的语法了，影响成为类似于python 3.x的语法了。

因为之前试过，python 3.x版本中，对于上述写法，即u加上中文，比如：

u‘这是中文’，是不支持的。而python 2.x的语法是支持的，表示这个字符是unicode的字符，会自动处理编码，成为unicode变量的。

所以，就是尝试了，卸载掉之前安装的PIL-1.1.7.win32-py2.7.exe，结果问题依旧。

2.对应着当前这个python脚本，之前还有一个版本，里面也有同样的这部分代码，结果却是，那个脚本可以正常执行，这个脚本无法执行，出现这个SyntaxError: (unicode error)错误，所以更加异怪了。

本来想找个beyondcompare来比较两个脚本之前到底有哪些区别的，结果由于暂时不方便下载安装使用beyondcompare，只好作罢。

3.重新安装了python 2.7.2，问题依旧。

4.重启电脑，问题依旧。

5.后来，把上述代码中中文字符前缀的u，改为函数unicode加上中文字符，即：

cat_no_unicode = opt_no_unicode.replace(unicode('类别：'), '')

结果，就可以正常执行了，没有了这个语法错误。

6.然后才突然想到，是不是脚本文件本身的编码不匹配，一去查看当前所用的notepad++中的当前脚本文件所用的编码，果然，用的是默认的ANSI，而不是utf-8的编码，所以，u"类别："，无法识别，而去notepad+中，格式->转为utf-8编码，后，再保存文件，再去运行脚本，就可以支持u"类别："了。

当然，对应的unicode（“类别：”）这样的做法，起始是最安全的，其不会受到你当前python脚本文件所用编码的影响。

【总结】

1。如果python中所要处理的字符串中包含中文，那么最好要搞懂所用字符的编码，是gbk/gb2312/gb18030，还是utf-8，否则容易出现乱码，以及此处的语法错误。

而为了保险起见，最好用unicode("中文字符")的方式，来使用，操作这些中文字符。

当然，如果你自己需要，自己也知道，那最好unicode("中文字符").encode("utf-8")的方法，把其转为utf-8的格式，这个编码最通用。

2.notepad++新建的文件，也最好使用比较通用的utf-8的格式来存储文件，而不要用默认的ANSI，否则其中的中文，会由于ANSI不支持，而默认用你本地语言，比如我此处的GBK，去编码，这样容易出现一些编码类的错误。

【后记1】

本来觉得上面的理解是完全正确的。

结果证实了，上面的理解，是错误的。。。

实际的例子是：

如果用unicode("下午").encode("utf-8")，然后python脚本可以执行通过，但是结果却不对，无法和我从网页中抓取到的utf-8的“下午“相匹配，导致代码运行结果不是我们想要的。

而只有通过u”下午“.encode("utf-8")得到的结果，才和我网页中抓取到的utf-8的”下午“相等，代码才可以按照预期的结果去执行。

所以，结论是：

【总结】

1.python脚本文件所用编码，如果可以，最好用utf-8.

2.脚本中，用到的中文字符，具体是unicode（”中文“）.encode("utf-8")，还是u”中文”.encode("utf-8")，你要自己尝试，才知道结果如何。至少我这里的，是后者，程序执行结果才是对的。

【后记2】

1.经过测试：

此处我从网页抓取的中文字符moring_afternoon_zhCN，isinstance(moring_afternoon_zhCN, unicode)的结果，是False，即不是unicode字符。和之前介绍的，网页内容被beautifulsoup处理后，就自动转为utf-8这点，所不一致了。暂未搞懂是啥原因。

而且实际一个是utf-8的字符和这个字符比较：


         afternoon_zhCN_utf8 = u"下午".encode("utf-8")
        #if unicode(moring_afternoon_zhCN).encode("utf-8") == afternoon_zhCN_utf8:
        if unicode(moring_afternoon_zhCN).encode("utf-8") == unicode("下午").encode("utf-8") :  # this line can not excute!!!
            hour = str(int(hour) + 8)

结果也会出错的：


 
    File "D:\tmp\WordPress\Others\to_wp\hi-baidu-mover_v2\hi-baidu-mover_v2011-12-22-office.py", line 1169, in parseAndSetEntryDatetime
    if unicode(moring_afternoon_zhCN).encode("utf-8") == unicode("涓嬪崍").encode("utf-8") :  # this line can excute, but result is worng !!!
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

2. 上面的isinstance(u"下午".encode("utf-8"), unicode)结果为False，即不是unicode字符，这点需要注意一下的。因为本身u"下午"是unicode字符，属于unicode类型，被转换编码encode("utf-8")后，就是普通的字符，属于str类型了。

转载自：http://againinput4.blog.163.com/blog/static/1727994912011112224749861/