NLP文本向量化（含Python代码）

520jefferson 2023-01-18 发布于北京

展开全文

作者Ctrl CV原载于知乎 https://zhuanlan.zhihu.com/p/597088538

人类语言具有高度模糊性，一句话可能有多重的意思或隐喻，而计算机当前还无法真正理解语言或文字的意义。因此，现阶段的主要做法是先将语音和文字转换成向量，在对向量进行分析或者使用深度学习建模。

本文目录：
一、常见的文本向量化方法
（1）one-hot词向量表示
（2）词袋模型 BOW
（3）词频-逆文档频率 TF-IDF
（4）N元模型 N-Gram
（5）单词-向量模型 Word2vec
（6）文档-向量模型 Doc2vec
（7）Glove模型
二、Tensorflow 词嵌入可视化工具

一、常见的文本向量化方法

（1）one-hot词向量表示

又称独热编码，将每个词表示成具有n个元素的向量，这个词向量中只有一个元素是1，其他元素都是0，不同词汇元素为0的位置不同，其中n的大小是整个语料中不同词汇的总数。

# 导入keras中的词汇映射器Tokenizer
from tensorflow.keras.preprocessing.text import Tokenizer
# 假定vocab为语料集所有不同词汇集合
vocab = {'我', '爱', '北京', '天安门', '升国旗'}
# 实例化一个词汇映射器对象
t = Tokenizer(num_words=None, char_level=False)
# 使用映射器拟合现有文本数据
t.fit_on_texts(vocab)

for token in vocab:
    zero_list = [0]*len(vocab)
    # 使用映射器转化现有文本数据, 每个词汇对应从1开始的自然数
    # 返回样式如: [[2]], 取出其中的数字需要使用[0][0]
    token_index = t.texts_to_sequences([token])[0][0] - 1
    zero_list[token_index] = 1
    print(token, '的one-hot编码为:', zero_list)

one-hot编码缺点：完全割裂了词与词之间的联系，而且在大语料集下，每个向量的长度过大，占据大量内存。

（2）词袋模型 BOW

词袋是指把一篇文章进行词汇的整理，然后统计每个词汇出现的次数，由前几名的词汇猜测全文大意。

具体做法包括：

分词：将整篇文章中的每个词汇切开，整理成生字表或字典。英文一般以空白或者句点隔开，中文需要通过特殊的方法进行处理如jieba等。
前置处理：先将词汇做词性还原，转换成小写。词性还原和转换小写都是为了避免，词汇统计出现分歧。
去除停用词：be动词、助动词、介词、冠词等不具有特殊意义的词汇称为停用词在文章中是大量存在的，需要将它们剔除，否则统计结果都是这些词汇。
词频统计：计算每个词汇在文章中出现的次数，由高到低进行排序。

# coding=utf-8
import collections

stop_words = ['\n', 'or', 'are', 'they', 'i', 'some', 'by', '—',
              'even', 'the', 'to', 'a', 'and', 'of', 'in', 'on', 'for',
              'that', 'with', 'is', 'as', 'could', 'its', 'this', 'other',
              'an', 'have', 'more', 'at', 'don’t', 'can', 'only', 'most']

maxlen = 1000
word_freqs = collections.Counter()
# word_freqs = {}
# print(word_freqs)
with open('../data/NLP_data/news.txt', 'r+', encoding='utf8') as f:
    for line in f:
        words = line.lower().split(' ')
        if len(words) > maxlen:
            maxlen = len(words)

        for word in words:
            if not (word in stop_words):
                word_freqs[word] += 1
                # 词频统计
                # count = word_freqs.get(word, 0)
                # print(count)
                # word_freqs[word] = count + 1

# print(word_freqs)
print(word_freqs.most_common(20))

# 按照字典的value进行排序
# a1 = sorted(word_freqs.items(), key=lambda x: x[1], reverse=True)
# print(a1[:20])
'''
[('stores', 15), ('convenience', 14), ('korean', 6), ('these', 6), ('one', 6), ('it’s', 6), ('from', 5), ('my', 5), ('you', 5), ('their', 5), ('just', 5), ('has', 5), ('new', 4), ('do', 4), ('also', 4), ('which', 4), ('find', 4), ('would', 4), ('like', 4), ('up', 4)]
'''

（3）词频-逆文档频率 TF-IDF

BOW 方法十分简单，效果也不错，不过他有个缺点，有些词汇不是停用词，但是在文章中经常出现，但对全文并不重要，比如only、most等，对猜测全文大意没有太多的帮助，所以提出了改良算法TF-IDF，他会针对跨文件常出现的词汇给与较低的分数，如only在每一个文件中都出现过，那么TF-IDF对他的评分就会很低。

第一步：计算词频

考虑到文章有长短之分，为了便于不同文章的比较，进行＂词频＂标准化。

或者

第二步，计算逆文档频率。

这时，需要一个语料库（corpus），用来模拟语言的使用环境。

如果一个词越常见，那么分母就越大，逆文档频率就越小越接近0。分母之所以要加1，是为了避免分母为0（即所有文档都不包含该词）。log表示对得到的值取对数。

第三步，计算TF-IDF。

可以看到，TF-IDF与一个词在文档中的出现次数成正比，与该词在整个语言中的出现次数成反比。所以，员动调取关键同的法就很清楚了，就是计算出文档的每个词的TF-IDF值，然后按降序排列，取排在最前面的几个词。

# TF-IDF匹配问答对
# coding=utf-8
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import numpy as np

corpus = [
    'This is the first document.',
    'This is the second document.',
    'And the third document.',
    'Is this the first document?'
]

vectorizer = CountVectorizer()
x = vectorizer.fit_transform(corpus)

word = vectorizer.get_feature_names()
print('Vocabulary:', word)

print(x.toarray())

# TF-IDF转换
transfomers = TfidfTransformer()
tfidf = transfomers.fit_transform(x)
print(np.around(tfidf.toarray(), 4))

from sklearn.metrics.pairwise import cosine_similarity
# 比较最后一句与其他句子的相似度
print(cosine_similarity(tfidf[-1], tfidf[:-1], dense_output=False))

这里需要注意的是sklearn计算TF-IDF公式有些许区别：

手动实现TF-IDF完整代码：

注意：分子分母同时增加1 为了平滑处理、增加了归一化处理计算平方根。

# coding=utf-8
import math
import numpy

corpus = [
    'what is the weather like today',
    'what is for dinner tonight',
    'this is a question worth pondering',
    'it is a beautiful day today'
]
words = []
# 对corpus分词
for i in corpus:
    words.append(i.split())


# 进行词频统计
def Counter(word_list):
    wordcount = []
    for i in word_list:
        count = {}
        for j in i:
            if not count.get(j):
                count.update({j: 1})
            elif count.get(j):
                count[j] += 1
        wordcount.append(count)
    return wordcount


wordcount = Counter(words)

print(wordcount)


# 计算TF(word代表被计算的单词，word_list是被计算单词所在文档分词后的字典)
def tf(word, word_list):
    return word_list.get(word) / sum(word_list.values())


# 统计含有该单词的句子数
def count_sentence(word, wordcount):
    return sum(1 for i in wordcount if i.get(word))


# 计算IDF
def idf(word, wordcount):
    # return math.log(len(wordcount) / (count_sentence(word, wordcount) + 1))  # 10
    # return numpy.log(len(wordcount) / (count_sentence(word, wordcount) + 1))   # e
    return math.log((1 + len(wordcount)) / (count_sentence(word, wordcount) + 1)) + 1  # e


# 计算TF-IDF
def tfidf(word, word_list, wordcount):
    # print(word, idf(word, wordcount))
    return tf(word, word_list) * idf(word, wordcount)


p = 1

for i in wordcount:
    tf_idfs = 0
    print('part:{}'.format(p))
    p = p + 1
    for j, k in i.items():
        print('word: {} ---- TF-IDF:{}'.format(j, tfidf(j, i, wordcount)))

        # 归一化
        tf_idfs += (tfidf(j, i, wordcount) ** 2)

    tf_idfs = tf_idfs ** 0.5
    print(tf_idfs)

    for j, k in i.items():
        print('归一化后：word: {} ---- TF-IDF:{}'.format(j, tfidf(j, i, wordcount) / tf_idfs))

    # break

'''

part:1
word: what ---- TF-IDF:0.04794701207529681
word: is ---- TF-IDF:-0.03719059188570162
word: the ---- TF-IDF:0.11552453009332421
word: weather ---- TF-IDF:0.11552453009332421
word: like ---- TF-IDF:0.11552453009332421
word: today ---- TF-IDF:0.04794701207529681
part:2
word: what ---- TF-IDF:0.05753641449035617
word: is ---- TF-IDF:-0.044628710262841945
word: for ---- TF-IDF:0.13862943611198905
word: dinner ---- TF-IDF:0.13862943611198905
word: tonight ---- TF-IDF:0.13862943611198905
part:3
word: this ---- TF-IDF:0.11552453009332421
word: is ---- TF-IDF:-0.03719059188570162
word: a ---- TF-IDF:0.04794701207529681
word: question ---- TF-IDF:0.11552453009332421
word: worth ---- TF-IDF:0.11552453009332421
word: pondering ---- TF-IDF:0.11552453009332421
part:4
word: it ---- TF-IDF:0.11552453009332421
word: is ---- TF-IDF:-0.03719059188570162
word: a ---- TF-IDF:0.04794701207529681
word: beautiful ---- TF-IDF:0.11552453009332421
word: day ---- TF-IDF:0.11552453009332421
word: today ---- TF-IDF:0.04794701207529681

'''

（4）N元模型 N-Gram

给定一段文本序列，其中n个词或字的相邻共现特征即n-gram特征，常用的n-gram特征是bi-gram和tri-gram特征，分别对应n为2和3。

# 一般n-gram中的n取2或者3, 这里取3为例
ngram_range = 3


def create_ngram_set(input_list):
    '''
    description: 从数值列表中提取所有的n-gram特征
    :param input_list: 输入的数值列表, 可以看作是词汇映射后的列表,
                       里面每个数字的取值范围为[1, 25000]
    :return: n-gram特征组成的集合

    eg:
    # >>> create_ngram_set([1, 4, 9, 4, 1, 4])
    {(4, 9), (4, 1), (1, 4), (9, 4)}
    '''
    return set(zip(*[input_list[i:] for i in range(ngram_range)]))


if __name__ == '__main__':
    input_list = [1, 3, 2, 1, 5, 3]
    res = create_ngram_set(input_list)
    print(res)

（5）单词-向量模型 Word2vec

BOW和TF-IDF都只着重于词汇出现在文件中的次数，未考虑语言、文字有上下文的关联，针对上下文的关联，Google研发团队提出了词向量Word2vec，将每个单子改以上下文表达，然后转换为向量，这就是词嵌入（word embedding），与TF-IDF输出的是稀疏向量不同，词嵌入的输出是一个稠密的样本空间。

词向量的两种做法：

# coding=utf-8
import gzip
import gensim

from gensim.test.utils import common_texts
# size：詞向量的大小，window：考慮上下文各自的長度
# min_count：單字至少出現的次數，workers：執行緒個數
model_simple = gensim.models.Word2Vec(sentences=common_texts, window=1,
                                      min_count=1, workers=4)
# 傳回 有效的字數及總處理字數
print(model_simple.train([['hello', 'world', 'michael']], total_examples=1, epochs=2))

sentences = [['cat', 'say', 'meow'], ['dog', 'say', 'woof']]

model_simple = gensim.models.Word2Vec(min_count=1)
model_simple.build_vocab(sentences)  # 建立生字表(vocabulary)
print(model_simple.train(sentences, total_examples=model_simple.corpus_count
                         , epochs=model_simple.epochs))


# 載入 OpinRank 語料庫：關於車輛與旅館的評論
data_file='../nlp-in-practice-master/word2vec/reviews_data.txt.gz'

with gzip.open (data_file, 'rb') as f:
    for i,line in enumerate (f):
        print(line)
        break


# 讀取 OpinRank 語料庫，並作前置處理
def read_input(input_file):
    with gzip.open (input_file, 'rb') as f:
        for i, line in enumerate (f):
            # 前置處理
            yield gensim.utils.simple_preprocess(line)

# 載入 OpinRank 語料庫，分詞
documents = list(read_input(data_file))
# print(documents)


print(len(documents))

# Word2Vec 模型訓練，約10分鐘
model = gensim.models.Word2Vec(documents,
                               vector_size=150, window=10,
                               min_count=2, workers=10)
print(model.train(documents, total_examples=len(documents), epochs=10))


# 測試『骯髒』相似詞
w1 = 'dirty'
print(model.wv.most_similar(positive=w1))
# positive：相似詞


# 測試『禮貌』相似詞
w1 = ['polite']
print(model.wv.most_similar(positive=w1, topn=6))
# topn：只列出前 n 名


# 測試『法國』相似詞
w1 = ['france']
print(model.wv.most_similar(positive=w1, topn=6))
# topn：只列出前 n 名


# 測試『床、床單、枕頭』相似詞及『長椅』相反詞
w1 = ['bed','sheet','pillow']
w2 = ['couch']
print(model.wv.most_similar(positive=w1, negative=w2, topn=10))
# negative：相反詞

# 比較兩詞相似機率
print(model.wv.similarity(w1='dirty', w2='smelly'))
print(model.wv.similarity(w1='dirty', w2='dirty'))

print(model.wv.similarity(w1='dirty', w2='clean'))

# 選出較不相似的字詞
print(model.wv.doesnt_match(['cat', 'dog', 'france']))

# 關鍵詞萃取(Keyword Extraction)
# https:///gensim_3.8.3/summarization/keywords.html
# from gensim.summarization import keywords


# # 測試語料
# text = '''Challenges in natural language processing frequently involve
# speech recognition, natural language understanding, natural language
# generation (frequently from formal, machine-readable logical forms),
# connecting language and machine perception, dialog systems, or some
# combination thereof.'''

# 關鍵詞萃取
# print(''.join(keywords(text)))

（6）文档-向量模型 Doc2vec

Doc2vec模型是受到了Word2Vec模型的启发。Word2Vec里预测词向量时，预测出来的词是含有词义的，Doc2vec中也是构建了相同的结构，所以Doc2vec克服了词袋模型中没有语义的缺点。假设现在存在训练样本，每个句子是训练样本，和Word2Vec一样，Doc2vec也有两种训练方式，一种是分布记忆的段落向量（Distributed Memory Model of Paragraph Vectors , PV-DM）类似于Word2Vec中的CBOW模型，另一种是分布词袋版本的段落向量（Distributed Bag of Words version of Paragraph Vector，PV-DBOW）类似于Word2Vec中的Skip-gram模型。

# coding=utf-8
import numpy as np
import nltk
import gensim
from gensim.models import word2vec
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn.metrics.pairwise import cosine_similarity

f = open('../data/FAQ/starbucks_faq.txt', 'r', encoding='utf8')
corpus = f.readlines()

print(corpus)

MAX_WORDS_A_LINE = 30
import string

print(string.punctuation)

stopword_list = set(nltk.corpus.stopwords.words('english')
                    + list(string.punctuation) + ['\n'])


# 分詞函數
def tokenize(text, stopwords, max_len=MAX_WORDS_A_LINE):
    return [token for token in gensim.utils.simple_preprocess(text
                                                              , max_len=max_len) if token not in stopwords]


# 分詞
document_tokens = []  # 整理後的字詞
for line in corpus:
    document_tokens.append(tokenize(line, stopword_list))

# 設定為 Gensim 標籤文件格式
tagged_corpus = [TaggedDocument(doc, [i]) for i, doc in
                 enumerate(document_tokens)]

# 訓練 Doc2Vec 模型
model_d2v = Doc2Vec(tagged_corpus, vector_size=MAX_WORDS_A_LINE, epochs=200)
model_d2v.train(tagged_corpus, total_examples=model_d2v.corpus_count,
                      epochs=model_d2v.epochs)

# 測試
questions = []
for i in range(len(document_tokens)):
    questions.append(model_d2v.infer_vector(document_tokens[i]))
questions = np.array(questions)
# print(questions.shape)

# 測試語句
# text = 'find allergen information'
# text = 'mobile pay'
text = 'verification code'
filtered_tokens = tokenize(text, stopword_list)
# print(filtered_tokens)

# 比較語句相似度
similarity = cosine_similarity(model_d2v.infer_vector(
    filtered_tokens).reshape(1, -1), questions, dense_output=False)

# 選出前 10 名
top_n = np.argsort(np.array(similarity[0]))[::-1][:10]
print(f'前 10 名 index:{top_n}\n')
for i in top_n:
    print(round(similarity[0][i], 4), corpus[i].rstrip('\n'))

（7）Glove模型

Glove由斯坦福大学所提出的另一套词嵌入模型，他们认为Word2vec并未考虑全局的概率分布，只以移动窗口内的词汇为样本，没有掌握全文的信息。因此，他们提出了词汇共现矩阵，考虑词汇同时出现的概率，解决Wordvec只看局部的缺陷以及BOW稀疏向量空间的问题。

# coding=utf-8
# 載入相關套件
import numpy as np

# 載入GloVe詞向量檔 glove.6B.300d.txt
'''
https://github.com/stanfordnlp/GloVe
'''
embeddings_dict = {}
with open('../data/glove/glove.6B.300d.txt', 'r', encoding='utf-8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], 'float32')
        embeddings_dict[word] = vector

# 隨意測試一個單字(love)，取得 GloVe 的詞向量
# print(embeddings_dict['love'])

# 字數
# print(len(embeddings_dict.keys()))

# 以歐基里德(euclidean)距離計算相似性
from scipy.spatial.distance import euclidean


def find_closest_embeddings(embedding):
    return sorted(embeddings_dict.keys(),
                  key=lambda word: euclidean(embeddings_dict[word], embedding))


print(find_closest_embeddings(embeddings_dict['king'])[1:10])

# 任意選 100 個單字
# words = list(embeddings_dict.keys())[100:200]
# print(words)
words = find_closest_embeddings(embeddings_dict['king'])[1:10]

from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# 以 T-SNE 降維至二個特徵
tsne = TSNE(n_components=2)
vectors = [embeddings_dict[word] for word in words]
Y = tsne.fit_transform(vectors)

# 繪製散佈圖，觀察單字相似度
plt.figure(figsize=(12, 8))
plt.axis('off')
plt.scatter(Y[:, 0], Y[:, 1])
for label, x, y in zip(words, Y[:, 0], Y[:, 1]):
    plt.annotate(label, xy=(x, y), xytext=(0, 0), textcoords='offset points')

plt.show()