使用python进行新闻文档聚类（潜在语义分析）

东西二王 2019-05-04

展开全文

在本文中，我将解释如何使用潜在语义分析（LSA）从一组新闻文章中聚类和查找类似的新闻文档。

LSA是一种NLP技术，用于找出一组文档中隐藏的概念或主题。

数据读取

首先导入一些必要的Python库：

import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import sys import nltk # nltk.download('stopwords') from nltk.corpus import stopwords # from bs4 import BeautifulSoup as Soup import json

使用python进行新闻文档聚类（潜在语义分析）

我的机器学习样本数据：

使用python进行新闻文档聚类（潜在语义分析）

以下Python代码用于在字符串列表中加载和存储数据，这部分完全取决于数据类型：

def parseLog(file):
 file = sys.argv[1]
 content = []
 with open(file) as f:
 content = f.readlines()
 content = [json.loads(x.strip()) for x in content]
 # print(content)
 
 data = json.loads(json.dumps(content))
 k=0
# preprocessing ////////////////////////////////
 content_list = []
 for i in data:
 string_content = ''
 if 'contents' in i:
	 for all in i['contents']:
	 if 'content' in all:
	 # print(str(all['content']))
	 string_content = string_content   str(all['content'])
	 content_list.append(string_content)

使用python进行新闻文档聚类（潜在语义分析）

content_list包含字符串列表中的完整数据。因此，如果有45000篇文章，content_list有45000个字符串。

数据预处理

现在我们将使用pandas来应用一些机器学习中的预处理技术。首先，我们将尝试尽可能地清理文本数据。想法是使用regex replace(' [^a-zA-Z#] '， ' ')一次性删除标点、数字和特殊字符，它将替换除空格以外的所有字符。然后我们将删除较短的单词，因为它们通常不包含有用的信息。最后，我们将所有文本都小写。

news_df = pd.DataFrame({'document':content_list}) # removing everything except alphabets` news_df['clean_doc'] = news_df['document'].str.replace('[^a-zA-Z#]', ' ') # removing null fields news_df = news_df[news_df['clean_doc'].notnull()] # removing short words news_df['clean_doc'] = news_df['clean_doc'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>3])) # make all text lowercase news_df['clean_doc'] = news_df['clean_doc'].apply(lambda x: x.lower())

使用python进行新闻文档聚类（潜在语义分析）

现在我们将从数据中删除stopwords。首先，我加载NLTK的英文停用词列表。stopwords是“a”，“the”或“in”之类的词，它们没有表达重要意义。

 stop_words = stopwords.words('english')
 stop_words.extend(['span','class','spacing','href','html','http','title', 'stats', 'washingtonpost'])
 # tokenization
 tokenized_doc = news_df['clean_doc'].apply(lambda x: x.split())
 # remove stop-words
 tokenized_doc = tokenized_doc.apply(lambda x: [item for item in x if item not in stop_words])
 # print(tokenized_doc)
 # de-tokenization
 detokenized_doc = []
 for i in range(len(tokenized_doc)):
 if i in tokenized_doc:
 t = ' '.join(tokenized_doc[i])
 detokenized_doc.append(t)
 # print(detokenized_doc)

使用python进行新闻文档聚类（潜在语义分析）

应用Tf-idf创建文档术语矩阵

现在，我们准备好了机器学习数据。我们将使用tfidf vectoriser创建一个文档项矩阵。我们将使用sklearn的TfidfVectorizer创建一个包含10,000项的矩阵。

from sklearn.feature_extraction.text import TfidfVectorizer # tfidf vectorizer of scikit learn vectorizer = TfidfVectorizer(stop_words=stop_words,max_features=10000, max_df = 0.5, use_idf = True, ngram_range=(1,3)) X = vectorizer.fit_transform(detokenized_doc) print(X.shape) # check shape of the document-term matrix terms = vectorizer.get_feature_names() # print(terms)

使用python进行新闻文档聚类（潜在语义分析）

ngram_range：unigrams，bigrams和trigrams。

这个document-term矩阵将在LSA中使用，并应用k-means对文档进行聚类。

使用k-means对文本文档进行聚类

在这一步中，我们将使用k-means算法对文本文档进行聚类。

 from sklearn.cluster import KMeans
 num_clusters = 10
 km = KMeans(n_clusters=num_clusters)
 km.fit(X)
 clusters = km.labels_.tolist()

使用python进行新闻文档聚类（潜在语义分析）

clusters将用于绘图。clusters是一个包含数字1到10的列表，将每个文档分为10个聚类。

主题建模

下一步是将每个项和文档表示为向量。我们将使用文档项矩阵并将其分解为多个矩阵。

我们将使用sklearn的randomized_svd执行矩阵分解任务。您需要一些LSA和奇异值分解(SVD)的知识来理解下面的部分。

在SVD的定义中，原始矩阵A ≈ UΣV*,其中U和V具有正交列，并且Σ是非负对角线。

from sklearn.decomposition import TruncatedSVD from sklearn.utils.extmath import randomized_svd U, Sigma, VT = randomized_svd(X, n_components=10, n_iter=100, random_state=122) # SVD represent documents and terms in vectors # svd_model = TruncatedSVD(n_components=2, algorithm='randomized', n_iter=100, random_state=122) # svd_model.fit(X) # print(U.shape) for i, comp in enumerate(VT): terms_comp = zip(terms, comp) sorted_terms = sorted(terms_comp, key= lambda x:x[1], reverse=True)[:7] print('Concept ' str(i) ': ') for t in sorted_terms: print(t[0]) print(' ')

使用python进行新闻文档聚类（潜在语义分析）

这里，U，sigma和VT是在分解矩阵之后获得的3个矩阵X 。VT是一个term-concept矩阵，U是document-concept矩阵，Sigma是concept-concept矩阵。

在上面的代码中，采取了10个concepts/topics （n_components=10）。然后我打印了那些concepts。示例concepts如下：

使用python进行新闻文档聚类（潜在语义分析）

主题可视化

为了找出我们的主题有多么不同，我们应该想象它们。当然，我们无法想象超过3个维度，但有一些技术，如PCA和t-SNE，可以帮助我们将高维数据可视化为较低维度。这里我们将使用一种称为UMAP（Uniform Manifold Approximation and Projection）的相对较新的技术。

为了发现我们的主题有多么不同，我们应该把它们图形化。当然，我们可视化时不能超过3个维度，但是有一些技术，比如PCA和t-SNE，可以帮助我们将高维数据可视化到更低的维度。在这里，我们将使用一个相对较新的技术：UMAP。

使用python进行新闻文档聚类（潜在语义分析）

 import umap
 X_topics=U*Sigma
 embedding = umap.UMAP(n_neighbors=100, min_dist=0.5, random_state=12).fit_transform(X_topics)
 plt.figure(figsize=(7,5))
 plt.scatter(embedding[:, 0], embedding[:, 1], 
 c = clusters,
 s = 10, # size
 edgecolor='none'
 )
 plt.show()
if __name__ == '__main__':
 parseLog(sys.argv[1])

使用python进行新闻文档聚类（潜在语义分析）