倒计时2天|Python&Stata数据分析课寒假工作坊
常见的文本分析,如情感分析,主要计算文本的积极情绪和消极情绪得分。
但是当文本中富含情绪,如喜怒哀乐等不同的情绪的时候,可以进行更细粒度的情绪分析。之前分享过 NRC词语情绪词典和词语色彩词典,但是没有教大家怎么使用。
今天使用两个数据 simplifyweibo4moods.csv数据太大, 咱们使用小样本smallsimplifyweibo4_moods.csv
import pandas as pd
df = pd.read_csv('small_simplifyweibo_4_moods.csv')
df.head()
查看四种情绪的分布情况 import matplotlib.pyplot as plt
df.label.value_counts().plot(kind='pie')
plt.show()
其中NRC词典为加拿大国家研究委员会信息技术研究所(Institute for Information Technology, National Research Council Canada. )组织制作的基于众包方式标注出的词典。 https://www./WebPages/NRC-Emotion-Lexicon.htm 参考文献 Mohammad, Saif M., and Peter D. Turney. "Crowdsourcing a word–emotion association lexicon." Computational Intelligence 29, no. 3 (2013): 436-465.
下面我们读取 NRC-Emotion-Lexicon-v0.92-InManyLanguages-web.xlsx import pandas as pd
lexion_df = pd.read_excel('NRC-Emotion-Lexicon-v0.92-InManyLanguages-web.xlsx')
lexion_df.head()
支持41种语言,包括 英文是标注的,其他语言是根据google tranlate将对应的英文翻译为其他语言。 lexion_df.columns
Index(['English Word', 'Arabic Translation (Google Translate)',
'Basque Translation (Google Translate)',
'Bengali Translation (Google Translate)',
'Catalan Translation (Google Translate)',
'Chinese (simplified) Translation (Google Translate)',
'Chinese (traditional) Translation (Google Translate)',
'Danish Translation (Google Translate)',
'Dutch Translation (Google Translate)',
'Esperanto Translation (Google Translate)',
'Finnish Translation (Google Translate)',
'French Translation (Google Translate)',
'German Translation (Google Translate)',
'Greek Translation (Google Translate)',
'Gujarati Translation (Google Translate)',
'Hebrew Translation (Google Translate)',
'Hindi Translation (Google Translate)',
'Irish Translation (Google Translate)',
'Italian Translation (Google Translate)',
'Japanese Translation (Google Translate)',
'Latin Translation (Google Translate)',
'Marathi Translation (Google Translate)',
'Persian Translation (Google Translate)',
'Portuguese Translation (Google Translate)',
'Romanian Translation (Google Translate)',
'Russian Translation (Google Translate)',
'Somali Translation (Google Translate)',
'Spanish Translation (Google Translate)',
'Sudanese Translation (Google Translate)',
'Swahili Translation (Google Translate)',
'Swedish Translation (Google Translate)',
'Tamil Translation (Google Translate)',
'Telugu Translation (Google Translate)',
'Thai Translation (Google Translate)',
'Turkish Translation (Google Translate)',
'Ukranian Translation (Google Translate)',
'Urdu Translation (Google Translate)',
'Vietnamese Translation (Google Translate)',
'Welsh Translation (Google Translate)',
'Yiddish Translation (Google Translate)',
'Zulu Translation (Google Translate)', 'Positive', 'Negative', 'Anger',
'Anticipation', 'Disgust', 'Fear', 'Joy', 'Sadness', 'Surprise',
'Trust'],
dtype='object')
之前的研究认为,尽管投资者一次对包含重大变化的财务报 chinese_df = lexion_df[['Chinese (simplified) Translation (Google Translate)', 'Positive', 'Negative',
'Anger','Anticipation', 'Disgust', 'Fear', 'Joy', 'Sadness', 'Surprise', 'Trust']]
chinese_df.head()
构建情感词列表 Positive = []
Negative = []
Anger = []
Anticipation = []
Disgust = []
Fear = []
Joy = []
Sadness = []
Surprise = []
Trust = []
for idx, row in chinese_df.iterrows():
if row['Positive']==1:
Positive.append(row['Chinese (simplified) Translation (Google Translate)'])
if row['Negative']==1:
Negative.append(row['Chinese (simplified) Translation (Google Translate)'])
if row['Anger']==1:
Anger.append(row['Chinese (simplified) Translation (Google Translate)'])
if row['Anticipation']==1:
Anticipation.append(row['Chinese (simplified) Translation (Google Translate)'])
if row['Disgust']==1:
Disgust.append(row['Chinese (simplified) Translation (Google Translate)'])
if row['Fear']==1:
Fear.append(row['Chinese (simplified) Translation (Google Translate)'])
if row['Joy']==1:
Joy.append(row['Chinese (simplified) Translation (Google Translate)'])
if row['Sadness']==1:
Sadness.append(row['Chinese (simplified) Translation (Google Translate)'])
if row['Surprise']==1:
Surprise.append(row['Chinese (simplified) Translation (Google Translate)'])
if row['Trust']==1:
Trust.append(row['Chinese (simplified) Translation (Google Translate)'])
print('词语列表构建完成')
词语列表构建完成
import jieba
import time
def emotion_caculate(text):
positive = 0
negative = 0
anger = 0
anticipation = 0
disgust = 0
fear = 0
joy = 0
sadness = 0
surprise = 0
trust = 0
wordlist = jieba.lcut(text)
wordset = set(wordlist)
wordfreq = []
for word in wordset:
freq = wordlist.count(word)
if word in Positive:
positive+=freq
if word in Negative:
negative+=freq
if word in Anger:
anger+=freq
if word in Anticipation:
anticipation+=freq
if word in Disgust:
disgust+=freq
if word in Fear:
fear+=freq
if word in Joy:
joy+=freq
if word in Sadness:
sadness+=freq
if word in Surprise:
surprise+=freq
if word in Trust:
trust+=freq
emotion_info = {
'positive': positive,
'negative': negative,
'anger': anger,
'anticipation': anticipation,
'disgust': disgust,
'fear':fear,
'joy':joy,
'sadness':sadness,
'surprise':surprise,
'trust':trust,
'length':len(wordlist)
}
indexs = ['length', 'positive', 'negative', 'anger', 'anticipation','disgust','fear','joy','sadness','surprise','trust']
return pd.Series(emotion_info, index=indexs)
emotion_caculate(text='这个国家再对这些制造假冒伪劣食品药品的人手软的话,那后果真的会相当糟糕。坐牢?从快判个死刑')
length 25
positive 1
negative 2
anger 1
anticipation 0
disgust 1
fear 1
joy 0
sadness 1
surprise 0
trust 2
dtype: int64
start = time.time()
#df['review']整体为series类型。
#series有apply方法
emotion_df = df['review'].apply(emotion_caculate)
end = time.time()
print(end-start)
emotion_df.head()
series数据变为dataframe,详情可了解下apply 理解pandas中的apply和map的作用和异同 将原始数据与分析结果合并, 输出到新的csv中。 output_df = pd.concat([df, emotion_df], axis=1)
output_df.to_csv('output.csv', index=False)
output_df.head()
我们查看一下随机抽查一下,看看
最fear 最positive 最negative 的分别是什么内容
fear = output_df.sort_values(by='fear',ascending=False)
fear.head()
#这是什么鬼
fear = output_df.sort_values(by='fear',ascending=False)
print(fear.iloc[0, :]['review'])
Run 神哪,为什么咧~ ~ ~ 难道我要一辈子工作狂下去么。。。
【12星座内心依赖症】★白羊—工作依赖★金牛—味觉依赖★
双子—用脑依赖★巨蟹—收藏依赖★狮子—争夺依赖★处女—清洁依赖
★天秤—交友依赖★天蝎—身体依赖★射手—跳槽依赖★魔羯
—自我批评依赖★水瓶—友情依赖★双鱼—伤情依赖
negative = output_df.sort_values(by='negative',ascending=False)
print(negative.iloc[0, :]['review'])
Run 这个图原来这么熟悉!我一条都没有。原来有一种病叫“恋爱恐惧症”,你有吗?
症状一:怕爱上别人
症状二:怕爱上别人后会深陷
症状三:怕受伤
症状四:怕被拒绝
症状五:怕在最爱的当下失去
症状六:怕恋爱让人失去自我
症状七:怕伤害别人
症状八:怕自己丢失一颗爱自由的心
症状九:怕恋爱后再也回不到以前
症状十:怕自己爱对方比对方爱自己还多
positive = output_df.sort_values(by='positive',ascending=False)
print(positive.iloc[0, :]['review'])
Run 《劳动合同法》确有荒诞之处,但就此认为中国经济必须继续依靠廉价劳动力,就更荒诞了。
提高劳动力价格,会使不少企业不堪成本破产,继而减少工作岗位,最终对劳动者不利。
但关键不在继续压低劳动力价格,而在减轻企业别的负担:沉重税收各项收费政策不公承担本该ZF
承担的责“《劳动合同法》实施之前,集团公司请了一位劳动法专家给我们讲课,这位专家自己也有一家小公司。
专家痛心疾首地说,劳动力价格便宜是中国最大的核心竞争力,都像劳动合同法这样瞎搞,
这个优势很快就会不复存在,中国的经济发展必将受到阻碍。”
最正面的不太对啊,再看看第二最positive的 positive = output_df.sort_values(by='positive',ascending=False)
print(positive.iloc[1, :]['review'])
Run 真正懂得欣赏美味的人,一定是懂得生活的人。谢谢老师的分享,遇见一本优雅诚恳的好书是我们的幸运,
也欢迎你多分享旅途中的精神食粮给大家!中信书店正策划一系列美味文化的活动,有机会真诚邀请叶老师一起合作哦!
呵~發 現 有書 店微博今天選 讀 我的《極 致之味》。感謝 了!美味。
葡萄酒要伊甘酒庄还是玛歌酒庄?火腿是西班牙的伊比利火腿,还是意大利吉贝罗火腿?
盐用布列塔尼盖朗地区的盐之花,还是冲绳的粟国之盐?
《极致之味》中会告诉你如何一层层抽丝剥茧地追索视觉嗅觉味觉触觉上每一分毫的微妙变化。
分析结束 使用这个NRC,最好是英文数据,毕竟是用英文数据英文场景英语母语者标注的情绪词典。其他语言虽然能进行情绪分析,但因为是从英文中翻译过来的,存在一定的问题
|