使用朴素贝叶斯，线性SVM，Logistic回归和随机森林进行分类

niudp 2018-12-24

展开全文

在这个数字时代，数据无处不在。谈到互联网，大多数都是以文本的形式出现的。本文我们将探索自然语言处理（NLP）。

与数字数据不同，文本数据很难处理。直接在它们上面使用数学模型是不可能的。现在，让我们看看如何使用NLP和一些基本的机器学习技术来解决它。本文将主要关注实际的实现，而不是所使用的技术背后的理论或数学理解。

第1步：选择数据集

与大多数机器学习程序一样，我们首先需要数据。您可以从任何网站获取文本数据，如电影评论网站或亚马逊产品评论等。

在这里，我将使用标记的文本数据集，可以在此处下载（https://www./wp-content/uploads/2016/07/text_emotion.csv）。

我们导入基本库，然后读取机器学习数据集。

#%%import pandas as pdimport numpy as npdata = pd.read_csv('text_emotion.csv')

使用朴素贝叶斯，线性SVM，Logistic回归和随机森林进行分类

第2步：了解数据集中的内容

这是一个简单的数据集，只有四列，推文ID（tweet ID），推文描述的情感（emotion depicted by the tweet），作者（author）和推文的文本内容（the text content of the tweet）。

我们不一定需要作者列。因此我们可以放弃它。

data = data.drop('author', axis=1)

该数据集总共有4万条推文，被标注为13种不同的情绪。我们在这里的任务是建立一个机器学习模型，让它能够准确识别出所描述的情绪(经过训练，它能够识别出这些情绪)。

在本教程中，为了简单起见，我们只考虑其中的两种情绪:“happiness”和“sadness”(整个数据样本中总共约有10,000条tweet)。因此，我们可以删除所有其他标签的行。

# Dropping rows with other emotion labelsdata = data.drop(data[data.sentiment == 'anger'].index)data = data.drop(data[data.sentiment == 'boredom'].index)data = data.drop(data[data.sentiment == 'enthusiasm'].index)data = data.drop(data[data.sentiment == 'empty'].index)data = data.drop(data[data.sentiment == 'fun'].index)data = data.drop(data[data.sentiment == 'relief'].index)data = data.drop(data[data.sentiment == 'surprise'].index)data = data.drop(data[data.sentiment == 'love'].index)data = data.drop(data[data.sentiment == 'hate'].index)data = data.drop(data[data.sentiment == 'neutral'].index)data = data.drop(data[data.sentiment == 'worry'].index)

使用朴素贝叶斯，线性SVM，Logistic回归和随机森林进行分类

请记住，如果您缺少本教程中使用的任何库，您始终可以使用Anaconda Prompt中的pip install命令下载它。

第3步：预处理数据

显然，我们不能对文本进行数学运算，机器学习模型都是数学模型。

那么，我们如何将所有这些文本数据转换为数学数据呢？记住，我们必须要处理好无数的组合，特殊的角色，更不用说，还有术语和俚语，甚至连字典都不能用作参考。

首先，让我们通过使所有内容小写，删除标点符号和停止词（如介词）来为文本带来一些统一性。

#Making all letters lowercasedata['content'] = data['content'].apply(lambda x: ' '.join(x.lower() for x in x.split()))#Removing Punctuation, Symbolsdata['content'] = data['content'].str.replace('[^\w\s]',' ')#Removing Stop Words using NLTKfrom nltk.corpus import stopwordsstop = stopwords.words('english')data['content'] = data['content'].apply(lambda x: ' '.join(x for x in x.split() if x not in stop))

使用朴素贝叶斯，线性SVM，Logistic回归和随机森林进行分类

为了获得任何正确的见解，我们需要将所有单词都以词根形式出现，即文本中的单词变体（例如复数形式，过去时等）必须全部转换为它所代表的基本单词。这称为lemmatisation。除此之外，我还添加了代码来恢复单词中字母的重复，并假设几乎没有任何单词连续重复两次以上的字母。虽然不是很准确，但它可以帮助进行一些修正。

#Lemmatisationfrom textblob import Worddata['content'] = data['content'].apply(lambda x: ' '.join([Word(word).lemmatize() for word in x.split()]))#Correcting Letter Repetitionsimport redef de_repeat(text): pattern = re.compile(r'(.)\1{2,}') return pattern.sub(r'\1\1', text)#%%data['content'] = data['content'].apply(lambda x: ' '.join(de_repeat(x) for x in x.split()))

使用朴素贝叶斯，线性SVM，Logistic回归和随机森林进行分类

下一个考虑是这样的想法：如果一个单词在整个数据样本中只出现一次，那么它很可能对确定文本的情绪没有影响。因此，我们可以从机器学习数据集中删除所有很少出现的单词，这些单词通常是关于当前上下文的专有名词和其他无关紧要的单词。

# Code to find the top 10,000 rarest words appearing in the datafreq = pd.Series(' '.join(data['content']).split()).value_counts()[-10000:]# Removing all those rarely appearing words from the datafreq = list(freq.index)data['content'] = data['content'].apply(lambda x: ' '.join(x for x in x.split() if x not in freq))

使用朴素贝叶斯，线性SVM，Logistic回归和随机森林进行分类

自然文本的下一个最大挑战是处理拼写错误，尤其是涉及到推文时。除此之外，我们如何处理文本中的讽刺？由于处理这些问题的复杂性，让我们暂时忽略它们。

进一步扩展，可以考虑用最常见的同义词替换单词。这有助于建立更好的机器学习模型。这也是在这里被忽略的。

使用朴素贝叶斯，线性SVM，Logistic回归和随机森林进行分类

情感分析方法综述

第4步：特征提取

一旦您使文本数据清晰，准确且无错误，每条推文都由一组关键字表示。现在，我们需要执行“特征提取”，即从可以数字呈现的数据中提取一些参数。在本文中，我们考虑这两个项，TF-IDF和计数向量（请记住，我们需要数学数据！）。

在执行特征提取之前，将数据拆分为训练集和测试集。

#Encoding output labels 'sadness' as '1' & 'happiness' as '0'from sklearn import preprocessinglbl_enc = preprocessing.LabelEncoder()y = lbl_enc.fit_transform(data.sentiment.values)# Splitting into training and testing data in 90:10 ratiofrom sklearn.model_selection import train_test_splitX_train, X_val, y_train, y_val = train_test_split(data.content.values, y, stratify=y, random_state=42, test_size=0.1, shuffle=True)

使用朴素贝叶斯，线性SVM，Logistic回归和随机森林进行分类

TF-IDF：此参数给出术语在数据中的相对重要性，并衡量它在文本中出现的频率。这可以直接在python中提取如下 -

# Extracting TF-IDF parametersfrom sklearn.feature_extraction.text import TfidfVectorizertfidf = TfidfVectorizer(max_features=1000, analyzer='word',ngram_range=(1,3))X_train_tfidf = tfidf.fit_transform(X_train)X_val_tfidf = tfidf.fit_transform(X_val)

使用朴素贝叶斯，线性SVM，Logistic回归和随机森林进行分类

计数向量：这是我们考虑的另一个特征，顾名思义我们将推文转换为一个数组，其中包含每个单词的出现次数。这里的直觉是传达相似情感的文本可能会一遍又一遍地重复相同的单词。

# Extracting Count Vectors Parametersfrom sklearn.feature_extraction.text import CountVectorizercount_vect = CountVectorizer(analyzer='word')count_vect.fit(data['content'])X_train_count = count_vect.transform(X_train)X_val_count = count_vect.transform(X_val)

使用朴素贝叶斯，线性SVM，Logistic回归和随机森林进行分类

第5步：训练我们的机器学习模型

通过准备好推文的数字表示，我们可以直接将它们用作一些经典机器学习模型的输入。

在这里，我们训练了四种不同的机器学习模型，如下面的代码所示。我们只关注实施部分。事实上，这四种方法可用于解决任何类型的分类问题。在我们的案例中，我们想要分类一个给定的推文是一个快乐的推文还是一个悲伤的推文。

话虽如此，我不会详细介绍这些算法的内部工作原理（但是，如果您有兴趣了解更多信息，一个简单的谷歌搜索应该可以帮助您）。现在，了解它们就足够了。另请注意，实现这些机器学习模型的语法是标准的。

首先，让我们使用TF-IDF构建一些模型 -

from sklearn.metrics import accuracy_score# Model 1: Multinomial Naive Bayes Classifierfrom sklearn.naive_bayes import MultinomialNBnb = MultinomialNB()nb.fit(X_train_tfidf, y_train)y_pred = nb.predict(X_val_tfidf)print('naive bayes tfidf accuracy %s' % accuracy_score(y_pred, y_val))naive bayes tfidf accuracy 0.5289017341040463# Model 2: Linear SVMfrom sklearn.linear_model import SGDClassifierlsvm = SGDClassifier(alpha=0.001, random_state=5, max_iter=15, tol=None)lsvm.fit(X_train_tfidf, y_train)y_pred = lsvm.predict(X_val_tfidf)print('svm using tfidf accuracy %s' % accuracy_score(y_pred, y_val))svm tfidf accuracy 0.5404624277456648# Model 3: logistic regressionfrom sklearn.linear_model import LogisticRegressionlogreg = LogisticRegression(C=1)logreg.fit(X_train_tfidf, y_train)y_pred = logreg.predict(X_val_tfidf)print('log reg tfidf accuracy %s' % accuracy_score(y_pred, y_val))log reg tfidf accuracy 0.5443159922928709# Model 4: Random Forest Classifierfrom sklearn.ensemble import RandomForestClassifierrf = RandomForestClassifier(n_estimators=500)rf.fit(X_train_tfidf, y_train)y_pred = rf.predict(X_val_tfidf)print('random forest tfidf accuracy %s' % accuracy_score(y_pred, y_val))random forest tfidf accuracy 0.5385356454720617

使用朴素贝叶斯，线性SVM，Logistic回归和随机森林进行分类

最佳模型的准确度仅为54.43％（Logistic回归），这意味着我们的模型很难对任何东西进行正确分类。这不是太好。这可能是因为我们使用的文本数据集的复杂性。

现在，让我们使用计数向量构建机器学习模型 -

# Model 1: Multinomial Naive Bayes Classifierfrom sklearn.naive_bayes import MultinomialNBnb = MultinomialNB()nb.fit(X_train_count, y_train)y_pred = nb.predict(X_val_count)print('naive bayes count vectors accuracy %s' % accuracy_score(y_pred, y_val))naive bayes count vectors accuracy 0.7764932562620424# Model 2: Linear SVMfrom sklearn.linear_model import SGDClassifierlsvm = SGDClassifier(alpha=0.001, random_state=5, max_iter=15, tol=None)lsvm.fit(X_train_count, y_train)y_pred = lsvm.predict(X_val_count)print('lsvm using count vectors accuracy %s' % accuracy_score(y_pred, y_val))lsvm using count vectors accuracy 0.7928709055876686# Model 3: Logistic Regressionfrom sklearn.linear_model import LogisticRegressionlogreg = LogisticRegression(C=1)logreg.fit(X_train_count, y_train)y_pred = logreg.predict(X_val_count)print('log reg count vectors accuracy %s' % accuracy_score(y_pred, y_val))log reg count vectors accuracy 0.7851637764932563# Model 4: Random Forest Classifierfrom sklearn.ensemble import RandomForestClassifierrf = RandomForestClassifier(n_estimators=500)rf.fit(X_train_count, y_train)y_pred = rf.predict(X_val_count)print('random forest with count vectors accuracy %s' % accuracy_score(y_pred, y_val))random forest with count vectors accuracy 0.7524084778420038

使用朴素贝叶斯，线性SVM，Logistic回归和随机森林进行分类

通过使用计数向量，我们可以显着提高性能。最好的模型，线性SVM实现高达79.28％的准确性。

这可能是因为这个特定数据集的性质，文本的情感严重依赖于某些重要形容词的存在。

现在让我们通过给这个模型一些随机文本输入来测试它在现实中的表现。

#Below are 8 random statements.#The first 4 depict happiness#The last 4 depict sadnesstweets = pd.DataFrame(['I am very happy today! The atmosphere looks cheerful','Things are looking great. It was such a good day','Success is right around the corner. Lets celebrate this victory','Everything is more beautiful when you experience them with a smile!','Now this is my worst, okay? But I am gonna get better.','I am tired, boss. Tired of being on the road, lonely as a sparrow in the rain. I am tired of all the pain I feel','This is quite depressing. I am filled with sorrow','His death broke my heart. It was a sad day'])# Doing some preprocessing on these tweets as done beforetweets[0] = tweets[0].str.replace('[^\w\s]',' ')from nltk.corpus import stopwordsstop = stopwords.words('english')tweets[0] = tweets[0].apply(lambda x: ' '.join(x for x in x.split() if x not in stop))from textblob import Wordtweets[0] = tweets[0].apply(lambda x: ' '.join([Word(word).lemmatize() for word in x.split()]))# Extracting Count Vectors feature from our tweetstweet_count = count_vect.transform(tweets[0])#Predicting the emotion of the tweet using our already trained linear SVMtweet_pred = lsvm.predict(tweet_count)print(tweet_pred)[0 0 0 0 1 1 1 1]

使用朴素贝叶斯，线性SVM，Logistic回归和随机森林进行分类

记住我们对输出的编码。'0'代表happiness ，'1'代表sadness。我们的机器学习模型正确地检测了所有8个句子的情绪！

但那么为什么我们最好的准确率只有79.28％？请注意，我用于测试的句子是标准的语法。没有拼写错误，没有使用俚语，讽刺或其他复杂的演讲，使我们的模型很容易分类。

实际的Twitter数据可能很难预处理。尽管如此，我们可以得出结论，对于正常的语法正确的推文，我们的模型非常有效。使用这个，我们可以确定一群人的整体观点，他们是否感到happiness 或sadness与某个事件或主题实时相关。我们还可以训练模型以检测其他特定情绪。

有几种方法可以进一步提高我们的准确性，例如使用更好的预处理技术或使用更多相关特征。还可以调整模型函数中的一些参数以获得更高的分数。