【原】如何对微博推文进行情绪分析(细粒度情感分析)

大邓的Python 2021-02-23

展开全文

倒计时2天｜Python&Stata数据分析课寒假工作坊

文本的情绪分析

常见的文本分析，如情感分析，主要计算文本的积极情绪和消极情绪得分。

但是当文本中富含情绪，如喜怒哀乐等不同的情绪的时候，可以进行更细粒度的情绪分析。之前分享过 NRC词语情绪词典和词语色彩词典，但是没有教大家怎么使用。

今天使用两个数据

数据集 simplifyweibo4moods.csv
词典NRC词典，包括喜怒哀乐等8种情绪

读取微博数据

simplifyweibo4moods.csv数据太大，

咱们使用小样本smallsimplifyweibo4_moods.csv

import pandas as pd
df = pd.read_csv('small_simplifyweibo_4_moods.csv')
df.head()

查看四种情绪的分布情况

import matplotlib.pyplot as plt
df.label.value_counts().plot(kind='pie')
plt.show()

NRC情绪词典

其中NRC词典为加拿大国家研究委员会信息技术研究所(Institute for Information Technology, National Research Council Canada. )组织制作的基于众包方式标注出的词典。

https://www./WebPages/NRC-Emotion-Lexicon.htm

参考文献
Mohammad, Saif M., and Peter D. Turney. "Crowdsourcing a word–emotion association lexicon." Computational Intelligence 29, no. 3 (2013): 436-465.

下面我们读取 NRC-Emotion-Lexicon-v0.92-InManyLanguages-web.xlsx

import pandas as pd
lexion_df = pd.read_excel('NRC-Emotion-Lexicon-v0.92-InManyLanguages-web.xlsx')
lexion_df.head()

支持的语言

支持41种语言，包括

英语
法语
阿拉伯语
德语
俄罗斯语
中文(简体、繁体)

英文是标注的，其他语言是根据google tranlate将对应的英文翻译为其他语言。

lexion_df.columns

Index(['English Word', 'Arabic Translation (Google Translate)',
'Basque Translation (Google Translate)',
'Bengali Translation (Google Translate)',
'Catalan Translation (Google Translate)',
'Chinese (simplified) Translation (Google Translate)',
'Chinese (traditional) Translation (Google Translate)',
'Danish Translation (Google Translate)',
'Dutch Translation (Google Translate)',
'Esperanto Translation (Google Translate)',
'Finnish Translation (Google Translate)',
'French Translation (Google Translate)',
'German Translation (Google Translate)',
'Greek Translation (Google Translate)',
'Gujarati Translation (Google Translate)',
'Hebrew Translation (Google Translate)',
'Hindi Translation (Google Translate)',
'Irish Translation (Google Translate)',
'Italian Translation (Google Translate)',
'Japanese Translation (Google Translate)',
'Latin Translation (Google Translate)',
'Marathi Translation (Google Translate)',
'Persian Translation (Google Translate)',
'Portuguese Translation (Google Translate)',
'Romanian Translation (Google Translate)',
'Russian Translation (Google Translate)',
'Somali Translation (Google Translate)',
'Spanish Translation (Google Translate)',
'Sudanese Translation (Google Translate)',
'Swahili Translation (Google Translate)',
'Swedish Translation (Google Translate)',
'Tamil Translation (Google Translate)',
'Telugu Translation (Google Translate)',
'Thai Translation (Google Translate)',
'Turkish Translation (Google Translate)',
'Ukranian Translation (Google Translate)',
'Urdu Translation (Google Translate)',
'Vietnamese Translation (Google Translate)',
'Welsh Translation (Google Translate)',
'Yiddish Translation (Google Translate)',
'Zulu Translation (Google Translate)', 'Positive', 'Negative', 'Anger',
'Anticipation', 'Disgust', 'Fear', 'Joy', 'Sadness', 'Surprise',
'Trust'],
dtype='object')

中文情绪词列表构建

之前的研究认为，尽管投资者一次对包含重大变化的财务报

chinese_df = lexion_df[['Chinese (simplified) Translation (Google Translate)', 'Positive', 'Negative',
'Anger','Anticipation', 'Disgust', 'Fear', 'Joy', 'Sadness', 'Surprise', 'Trust']]
chinese_df.head()

构建情感词列表

Positive = []
Negative = []
Anger = []
Anticipation = []
Disgust = []
Fear = []
Joy = []
Sadness = []
Surprise = []
Trust = []
for idx, row in chinese_df.iterrows():
if row['Positive']==1:
Positive.append(row['Chinese (simplified) Translation (Google Translate)'])
if row['Negative']==1:
Negative.append(row['Chinese (simplified) Translation (Google Translate)'])
if row['Anger']==1:
Anger.append(row['Chinese (simplified) Translation (Google Translate)'])
if row['Anticipation']==1:
Anticipation.append(row['Chinese (simplified) Translation (Google Translate)'])
if row['Disgust']==1:
Disgust.append(row['Chinese (simplified) Translation (Google Translate)'])
if row['Fear']==1:
Fear.append(row['Chinese (simplified) Translation (Google Translate)'])
if row['Joy']==1:
Joy.append(row['Chinese (simplified) Translation (Google Translate)'])
if row['Sadness']==1:
Sadness.append(row['Chinese (simplified) Translation (Google Translate)'])
if row['Surprise']==1:
Surprise.append(row['Chinese (simplified) Translation (Google Translate)'])
if row['Trust']==1:
Trust.append(row['Chinese (simplified) Translation (Google Translate)'])
print('词语列表构建完成')

词语列表构建完成

设计文本情绪计算函数

import jieba
import time
def emotion_caculate(text):
positive = 0
negative = 0
anger = 0
anticipation = 0
disgust = 0
fear = 0
joy = 0
sadness = 0
surprise = 0
trust = 0
wordlist = jieba.lcut(text)
wordset = set(wordlist)
wordfreq = []
for word in wordset:
freq = wordlist.count(word)
if word in Positive:
positive+=freq
if word in Negative:
negative+=freq
if word in Anger:
anger+=freq
if word in Anticipation:
anticipation+=freq
if word in Disgust:
disgust+=freq
if word in Fear:
fear+=freq
if word in Joy:
joy+=freq
if word in Sadness:
sadness+=freq
if word in Surprise:
surprise+=freq
if word in Trust:
trust+=freq
emotion_info = {
'positive': positive,
'negative': negative,
'anger': anger,
'anticipation': anticipation,
'disgust': disgust,
'fear':fear,
'joy':joy,
'sadness':sadness,
'surprise':surprise,
'trust':trust,
'length':len(wordlist)
}
indexs = ['length', 'positive', 'negative', 'anger', 'anticipation','disgust','fear','joy','sadness','surprise','trust']
return pd.Series(emotion_info, index=indexs)