R语言自然语言处理-text2vec

脑系科数据科学 2020-04-29

展开全文

1. text2vec 背景与基本原理

text2vec包是由Dmitriy Selivanov于2016年10月所写的R包。此包主要是为文本分析和自然语言处理提供了一个简单高效的API框架。

由于其由C++所写，同时许多部分（例如GloVe）都充分运用RcppParallel等包进行并行化操作，处理速度得到加速。并且采样流处理器，可以不必把全部数据载入内存才进行分析，有效利用了内存，可以说该包是充分考虑了NLP处理数据量庞大的现实。

text2vec包也可以说是一个文本分析的生态系统，可以进行词向量化操作（Vectorization）、Word2Vec的“升级版GloVe词嵌入表达）、主题模型分析以及相似性度量四大方面，可以说非常的强大和实用。详情可见官网

本文挖掘的基本流程

构建一个文档 - 词频矩阵（document-term matrix，DTM）或者词频共现矩阵（term-co-occurrence matrix matrix，TCM），或者TFIDF
在DTM基础上拟合模型，包括文本（情感）分类，主题模型，相似性度量等并进行模型的调试和验证。
最终在新的数据上运用拟合好的模型。

所以最重要的一步是将文本数字化，如何数字化，前面提到了几种方法，接下来会进行介绍

2. DTM与TFIDF 原理和实现

2.1 DTM和TFIDF的思想

DTM是什么，对于DTM矩阵，维基百科上给出了详细的解释。这个解释很简单也很容易懂，有两个document分别名为D1，D2：

也就是说DTM矩阵是每个Document中每个term（单词，或是词汇表vocab）出现的次数。这是一种非常直观的，将本文句子，转化成为数字的一个方法。但是呢，如果计算每一个词的频次，那么最后出现的矩阵，是一个非常大的稀疏矩阵。

所以就有一个问题，我们是不是需要计算每一个term 的频次呢，有些词出现太少了，统计出来没有意义。有些词，出现很平凡，但是没有什么意义，比如“的”，“是”

于是就有了TFIDF

TFIDF的原理是：评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。字词的重要性随着它在文件中出现的次数成正比增加，但同时会随着它在语料库中出现的频率成反比下降。

简单的来书，如果一个词对一个文档很有代表作用，那么这个词在这个文档里面肯定会出现很多次，但是这个词不会再其他文档也出现这么多次。

TF：在一份给定的文件里，词频（term frequency，TF）指的是某一个给定的词语在该文件中出现的次数。这个数字通常会被正规化，以防止它偏向长的文件，计算公式如下：

以上式子中ni,j是该词在文件dj中的出现次数，而分母则是在文件dj中所有字词的出现次数之和。

IDF：逆向文件频率（inverse document frequency，IDF）是一个词语普遍重要性的度量。计算公式如下：

其中：

分子|D|表示语料库中的文件总数

分母表示含词语ti的文件数目，如果如果该词语不在语料库中，就会导致被除数为零，因此一般情况下使用时会加1

则TFIDF:

某一特定文件内的高词语频率，以及该词语在整个文件集合中的低文件频率，可以产生出高权重的TF-IDF。因此，TF-IDF倾向于过滤掉常见的词语，保留重要的词语。

这就是DTM和TFIDF的思想

2.2 实现DTM

DTM的实现

设置分词迭代器
设置分词的消除停用词
进行分词
对低频词进行修剪
构建语料库
构建DTM矩阵

这里使用的数据集合是自带的电影数据集

# 数据准备require(tidyverse)

## Loading required package: tidyverse

## ── Attaching packages ───────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.1.0       ✔ purrr   0.3.2  
## ✔ tibble  2.1.1       ✔ dplyr   0.8.0.1
## ✔ tidyr   0.8.3       ✔ stringr 1.4.0  
## ✔ readr   1.3.1       ✔ forcats 0.4.0

## ── Conflicts ──────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

require(text2vec)

## Loading required package: text2vec

require(data.table)

## Loading required package: data.table

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

## The following object is masked from 'package:purrr':
## 
##     transpose

data("movie_review")  
setDT(movie_review)  
setkey(movie_review, id)  
set.seed(2016L)  
all_ids = movie_review$id  
train_ids = sample(all_ids, 4000)  
test_ids = setdiff(all_ids, train_ids)  
train = movie_review[J(train_ids)]  
test = movie_review[J(test_ids)] 

# 开始构建prep_fun = tolower   # 转换大小写#代表词语划分到什么程度tok_fun = word_tokenizer   # 用于拆分字符串的工具#步骤1.设置分词迭代器it_train = itoken(train$review,    # 这个是语料
                  preprocessor = prep_fun,   
                  tokenizer = tok_fun,   
                  ids = train$id,   # 可以不设置Id
                  progressbar = FALSE)#步骤2.分词#消除停用词stop_words = c("i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours")  

#分词函数 ：create_vocabulary ，传入一个迭代器，和停用词vocab = create_vocabulary(it_train, stopwords = stop_words) 
head(vocab)

## Number of docs: 4000 
## 11 stopwords: i, me, my, myself, we, our ... 
## ngram_min = 1; ngram_max = 1 
## Vocabulary: 
##           term term_count doc_count
## 1:  injections          1         1
## 2:     everone          1         1
## 3:       argie          1         1
## 4:   naturists          1         1
## 5:         zag          1         1
## 6: koenekamp's          1         1

#对低频词的修剪pruned_vocab = prune_vocabulary(vocab,   
                                term_count_min = 10,   #词频，低于10个都删掉
                                doc_proportion_max = 0.5,  
                                doc_proportion_min = 0.001) 
head(pruned_vocab)

## Number of docs: 4000 
## 11 stopwords: i, me, my, myself, we, our ... 
## ngram_min = 1; ngram_max = 1 
## Vocabulary: 
##            term term_count doc_count
## 1: accompanying         10        10
## 2:        react         10        10
## 3:      pressed         10        10
## 4:        walsh         10         8
## 5:       unsure         10        10
## 6:        trace         10        10

#步骤3.设置形成语料文件 vectorizer = vocab_vectorizer(pruned_vocab)
head(vectorizer)

##                                                                           
## 1 function (iterator, grow_dtm, skip_grams_window_context, window_size,   
## 2     weights)                                                            
## 3 {                                                                       
## 4     vocab_corpus_ptr = cpp_vocabulary_corpus_create(vocabulary$term,    
## 5         attr(vocabulary, "ngram")[[1]], attr(vocabulary, "ngram")[[2]], 
## 6         attr(vocabulary, "stopwords"), attr(vocabulary, "sep_ngram"))

#步骤4.构建DTM矩阵 通过传入分词迭代器 和 语料文件 dtm_train = create_dtm(it_train, vectorizer)
head(dtm_train)

## [1] 0 0 0 0 0 0

其中比较重要的一点是语料库，有了语料库，有了新数据就可以直接转换成为DTM矩阵

prep_fun = tolower   # 转换大小写#代表词语划分到什么程度tok_fun = word_tokenizer   # 用于拆分字符串的工具#步骤1.设置分词迭代器it_test = itoken(test$review,    # 这个是语料
                  preprocessor = prep_fun,   
                  tokenizer = tok_fun,   
                  progressbar = FALSE)

dtm_test = create_dtm(it_train, vectorizer)
head(dtm_test)

## [1] 0 0 0 0 0 0

这样对测试数据也构建好了dtm矩阵

2.3 实现TFIDF

TFIDF的构建是建立在DTM的基础之上的，步骤如下：

1.设置TFIDF编译器 2.转换成TFIDF格式

tfidf = TfIdf$new()  

tm_train_tfidf = fit_transform(dtm_train, tfidf)
head(tm_train_tfidf)

## [1] 0 0 0 0 0 0

# 构建测试集合的tfidfprep_fun = tolower   # 转换大小写#代表词语划分到什么程度tok_fun = word_tokenizer   # 用于拆分字符串的工具#步骤1.设置分词迭代器it_test = itoken(test$review,    # 这个是语料
                  preprocessor = prep_fun,   
                  tokenizer = tok_fun,   
                  progressbar = FALSE)# dtm_test_tfidf  = create_dtm(it_test, vectorizer) %>%   fit_transform(tfidf)

到这里我们就构建好了DTM和TFIDF矩阵了，接下来可以进行各种建模

3. 进行情感分析

进行情感建模我们首先要有一个标签，来标注某个文本所代表的情感，之前所用的数据已经包含了标签：

train[1,]

##         id sentiment
## 1: 11912_2         0
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          review
## 1: The story behind this movie is very interesting, and in general the plot is not so bad... but the details: writing, directing, continuity, pacing, action sequences, stunts, and use of CG all cheapen and spoil the film.<br /><br />First off, action sequences. They are all quite unexciting. Most consist of someone standing up and getting shot, making no attempt to run, fight, dodge, or whatever, even though they have all the time in the world. The sequences just seem bland for something made in 2004.<br /><br />The CG features very nicely rendered and animated effects, but they come off looking cheap because of how they are used.<br /><br />Pacing: everything happens too quickly. For example, \\"Elle\\" is trained to fight in a couple of hours, and from the start can do back-flips, etc. Why is she so acrobatic? None of this is explained in the movie. As Lilith, she wouldn't have needed to be able to do back flips - maybe she couldn't, since she had wings.<br /><br />Also, we have sequences like a woman getting run over by a car, and getting up and just wandering off into a deserted room with a sink and mirror, and then stabbing herself in the throat, all for no apparent reason, and without any of the spectators really caring that she just got hit by a car (and then felt the secondary effects of another, exploding car)... \\"Are you okay?\\" asks the driver \\"yes, I'm fine\\" she says, bloodied and disheveled.<br /><br />I watched it all, though, because the introduction promised me that it would be interesting... but in the end, the poor execution made me wish for anything else: Blade, Vampire Hunter D, even that movie with vampires where Jackie Chan was comic relief, because they managed to suspend my disbelief, but this just made me want to shake the director awake, and give the writer a good talking to.

因此我们在构建好DTM或者TFIDF之后就可以建立情感模型，所以第一步就是建立DTM和TFIDF

1. 构建DTM

# 数据准备data("movie_review")  
setDT(movie_review)  
setkey(movie_review, id)  
set.seed(2016L)  
all_ids = movie_review$id  
train_ids = sample(all_ids, 4000)  
test_ids = setdiff(all_ids, train_ids)  
train = movie_review[J(train_ids)]  
test = movie_review[J(test_ids)] 

# 开始构建prep_fun = tolower   # 转换大小写#代表词语划分到什么程度tok_fun = word_tokenizer   # 用于拆分字符串的工具#步骤1.设置分词迭代器it_train = itoken(train$review,    # 这个是语料
                  preprocessor = prep_fun,   
                  tokenizer = tok_fun,   
                  ids = train$id,   # 可以不设置Id
                  progressbar = FALSE)#步骤2.分词#消除停用词stop_words = c("i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours")  

#分词函数 ：create_vocabulary ，传入一个迭代器，和停用词vocab = create_vocabulary(it_train, stopwords = stop_words) 
head(vocab)

## Number of docs: 4000 
## 11 stopwords: i, me, my, myself, we, our ... 
## ngram_min = 1; ngram_max = 1 
## Vocabulary: 
##           term term_count doc_count
## 1:  injections          1         1
## 2:     everone          1         1
## 3:       argie          1         1
## 4:   naturists          1         1
## 5:         zag          1         1
## 6: koenekamp's          1         1

#对低频词的修剪pruned_vocab = prune_vocabulary(vocab,   
                                term_count_min = 10,   #词频，低于10个都删掉
                                doc_proportion_max = 0.5,  
                                doc_proportion_min = 0.001) 
head(pruned_vocab)

## Number of docs: 4000 
## 11 stopwords: i, me, my, myself, we, our ... 
## ngram_min = 1; ngram_max = 1 
## Vocabulary: 
##            term term_count doc_count
## 1: accompanying         10        10
## 2:        react         10        10
## 3:      pressed         10        10
## 4:        walsh         10         8
## 5:       unsure         10        10
## 6:        trace         10        10

#步骤3.设置形成语料文件 vectorizer = vocab_vectorizer(pruned_vocab)
head(vectorizer)

##                                                                           
## 1 function (iterator, grow_dtm, skip_grams_window_context, window_size,   
## 2     weights)                                                            
## 3 {                                                                       
## 4     vocab_corpus_ptr = cpp_vocabulary_corpus_create(vocabulary$term,    
## 5         attr(vocabulary, "ngram")[[1]], attr(vocabulary, "ngram")[[2]], 
## 6         attr(vocabulary, "stopwords"), attr(vocabulary, "sep_ngram"))

#步骤4.构建DTM矩阵 通过传入分词迭代器 和 语料文件 dtm_train = create_dtm(it_train, vectorizer)
head(dtm_train)

## [1] 0 0 0 0 0 0

prep_fun = tolower   # 转换大小写#代表词语划分到什么程度tok_fun = word_tokenizer   # 用于拆分字符串的工具#步骤1.设置分词迭代器it_test = itoken(test$review,    # 这个是语料
                  preprocessor = prep_fun,   
                  tokenizer = tok_fun, 
                 ids = test$id,
                  progressbar = FALSE)

dtm_test = create_dtm(it_test, vectorizer)
head(dtm_test)

## [1] 0 0 0 0 0 0

2. 构建情感模型

使用逻辑回归模型来作为我们的情感模型，构建模型，然后进行验证

library(glmnet)

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following object is masked from 'package:tidyr':
## 
##     expand

## Loading required package: foreach

## 
## Attaching package: 'foreach'

## The following objects are masked from 'package:purrr':
## 
##     accumulate, when

## Loaded glmnet 2.0-16

NFOLDS = 4  glmnet_classifier = cv.glmnet(x = dtm_train, y = train[['sentiment']],family='binomial',   
                                                           # L1 penalty  
                                                           alpha = 1,  
                                                           # interested in the area under ROC curve  
                                                           type.measure = "auc",  
                                                           # 5-fold cross-validation  
                                                           nfolds = NFOLDS,  
                                                           # high value is less accurate, but has faster training  
                                                           thresh = 1e-3,  
                                                           # again lower number of iterations for faster training  
                                                           maxit = 1e3)  
plot(glmnet_classifier)

preds = predict(glmnet_classifier, dtm_test, type = 'response')[,1] 
glmnet:::auc(test$sentiment, preds)

## [1] 0.917145

preds[preds<=0.5]=0preds[preds>0.5]=1preds <- as.integer(preds)

caret::confusionMatrix(table(preds,test$sentiment))

## Confusion Matrix and Statistics
## 
##      
## preds   0   1
##     0 409  64
##     1  94 433
##                                           
##                Accuracy : 0.842           
##                  95% CI : (0.8179, 0.8641)
##     No Information Rate : 0.503           
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.6841          
##  Mcnemar's Test P-Value : 0.02105         
##                                           
##             Sensitivity : 0.8131          
##             Specificity : 0.8712          
##          Pos Pred Value : 0.8647          
##          Neg Pred Value : 0.8216          
##              Prevalence : 0.5030          
##          Detection Rate : 0.4090          
##    Detection Prevalence : 0.4730          
##       Balanced Accuracy : 0.8422          
##                                           
##        'Positive' Class : 0               
##

3. 构建TFIDF

tfidf = TfIdf$new()  

tm_train_tfidf = fit_transform(dtm_train, tfidf)# 构建测试集合的tfidfprep_fun = tolower   # 转换大小写#代表词语划分到什么程度tok_fun = word_tokenizer   # 用于拆分字符串的工具#步骤1.设置分词迭代器it_test = itoken(test$review,    # 这个是语料
                  preprocessor = prep_fun,   
                  tokenizer = tok_fun,   
                  progressbar = FALSE)# dtm_test_tfidf  = create_dtm(it_test, vectorizer) %>%   fit_transform(tfidf)

4. 利用TFIDF进行构建情感模型

library(glmnet)  
NFOLDS = 4  glmnet_classifier = cv.glmnet(x = tm_train_tfidf, y = train[['sentiment']],family='binomial',   
                                                           # L1 penalty  
                                                           alpha = 1,  
                                                           # interested in the area under ROC curve  
                                                           type.measure = "auc",  
                                                           # 5-fold cross-validation  
                                                           nfolds = NFOLDS,  
                                                           # high value is less accurate, but has faster training  
                                                           thresh = 1e-3,  
                                                           # again lower number of iterations for faster training  
                                                           maxit = 1e3)  
plot(glmnet_classifier)

preds = predict(glmnet_classifier, dtm_test_tfidf, type = 'response')[,1] 
glmnet:::auc(test$sentiment, preds)

## [1] 0.9111048

preds[preds<=0.5]=0preds[preds>0.5]=1preds <- as.integer(preds)

caret::confusionMatrix(table(preds,test$sentiment))

## Confusion Matrix and Statistics
## 
##      
## preds   0   1
##     0 412  71
##     1  91 426
##                                           
##                Accuracy : 0.838           
##                  95% CI : (0.8137, 0.8603)
##     No Information Rate : 0.503           
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.6761          
##  Mcnemar's Test P-Value : 0.1355          
##                                           
##             Sensitivity : 0.8191          
##             Specificity : 0.8571          
##          Pos Pred Value : 0.8530          
##          Neg Pred Value : 0.8240          
##              Prevalence : 0.5030          
##          Detection Rate : 0.4120          
##    Detection Prevalence : 0.4830          
##       Balanced Accuracy : 0.8381          
##                                           
##        'Positive' Class : 0               
##

4. LDA主题模型模型以及实现

什么是LDA模型，LDA 是一种非监督机器学习技术，可以用来识别大规模文档集（document collection）或语料库（corpus）中潜藏的主题信息。它采用了词袋（bag of words）的方法，这种方法将每一篇文档视为一个词频向量，从而将文本信息转化为了易于建模的数字信息。但是词袋方法没有考虑词与词之间的顺序，这简化了问题的复杂性，同时也为模型的改进提供了契机。每一篇文档代表了一些主题所构成的一个概率分布，而每一个主题又代表了很多单词所构成的一个概率分布。由于 Dirichlet 分布随机向量各分量间的弱相关性（之所以还有点 “相关”，是因为各分量之和必须为 1），使得我们假想的潜在主题之间也几乎是不相关的，这与很多实际问题并不相符，从而造成了 LDA 的又一个遗留问题。

对于语料库中的每篇文档，LDA 定义了如下生成过程（generative process）：

1.对每一篇文档，从主题分布中抽取一个主题；

2.从上述被抽到的主题所对应的单词分布中抽取一个单词；

重复上述过程直至遍历文档中的每一个单词。

简单的理解就是对文档的聚类，将文档根据不同的主题聚类起来

构建LDA模型首先也还是需要先将文本转化成为DTM矩阵或者TFIDF矩阵，然后进行构建

data("movie_review")  
setDT(movie_review)  
setkey(movie_review, id)  
set.seed(2016L)  
all_ids = movie_review$id  
train_ids = sample(all_ids, 4000)  
test_ids = setdiff(all_ids, train_ids)  
train = movie_review[J(train_ids)]  
test = movie_review[J(test_ids)] 

# 开始构建prep_fun = tolower   # 转换大小写#代表词语划分到什么程度tok_fun = word_tokenizer   # 用于拆分字符串的工具#步骤1.设置分词迭代器it_train = itoken(train$review,    # 这个是语料
                  preprocessor = prep_fun,   
                  tokenizer = tok_fun,   
                  ids = train$id,   # 可以不设置Id
                  progressbar = FALSE)#步骤2.分词#消除停用词stop_words = c("i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours")  

#分词函数 ：create_vocabulary ，传入一个迭代器，和停用词vocab = create_vocabulary(it_train, stopwords = stop_words) 
#对低频词的修剪pruned_vocab = prune_vocabulary(vocab,   
                                term_count_min = 10,   #词频，低于10个都删掉
                                doc_proportion_max = 0.5,  
                                doc_proportion_min = 0.001) 
#步骤3.设置形成语料文件 vectorizer = vocab_vectorizer(pruned_vocab)
vectorizer

## function (iterator, grow_dtm, skip_grams_window_context, window_size, 
##     weights) 
## {
##     vocab_corpus_ptr = cpp_vocabulary_corpus_create(vocabulary$term, 
##         attr(vocabulary, "ngram")[[1]], attr(vocabulary, "ngram")[[2]], 
##         attr(vocabulary, "stopwords"), attr(vocabulary, "sep_ngram"))
##     setattr(vocab_corpus_ptr, "ids", character(0))
##     setattr(vocab_corpus_ptr, "class", "VocabCorpus")
##     corpus_insert(vocab_corpus_ptr, iterator, grow_dtm, skip_grams_window_context, 
##         window_size, weights)
## }
## <bytecode: 0x7fa976ef9e50>
## <environment: 0x7fa979d67d20>

#步骤4.构建DTM矩阵 通过传入分词迭代器 和 语料文件 dtm_train = create_dtm(it_train, vectorizer)

prep_fun = tolower   # 转换大小写#代表词语划分到什么程度tok_fun = word_tokenizer   # 用于拆分字符串的工具#步骤1.设置分词迭代器it_test = itoken(test$review,    # 这个是语料
                  preprocessor = prep_fun,   
                  tokenizer = tok_fun, 
                 ids = test$id,
                  progressbar = FALSE)

dtm_test = create_dtm(it_test, vectorizer)


lda_model = LDA$new(n_topics = 10)

doc_topic_distr = lda_model$fit_transform(dtm_train, n_iter = 20)

## INFO [2019-06-04 17:01:25] iter 10 loglikelihood = -3748832.642
## INFO [2019-06-04 17:01:25] iter 20 loglikelihood = -3635602.774

lda_model$plot()

## Loading required namespace: servr

结果中，每一行代表每一个文本是不同主题的一个概率

5. 相似性度量

text2vec提供了2套函数集测量变量距离/相似性。他们是：

sim2（x，y，method）：分别计算x * y个相似性;
psim2（x，x，method）：平行地求数据的相似性，x个相似性;
dist2（x，y，method）：跟sim2相反，分别计算x * y个距离;
pdist2（x，x，method），平行地求数据的距离，x个距离。

最常用的就是sim2

实现的话第一步还是需要构建DTM矩阵，因此直接直接使用之前的DTM数据

Jaccard相似度

d1_d2_jac_sim = sim2(dtm_test, dtm_train, method = "jaccard", norm = "none")

余弦相似度

d1_d2_cos_sim = sim2(dtm_train, dtm_test, method = "cosine", norm = "l2")

解析来我们回来做一个案例，构建一个自动问答系统

6. 自动问答系统

思路就是：

准备语料
构建问题与答案的关系
构建DTM矩阵
计算提问与目标问题的相似性
获取答案

首先构建语料，构建的语料库是与数学有关的，

math = data.frame(c("子矩阵","线性方程组","线性变换","方阵","单位阵"),
                  c("子矩陣是在矩陣選取部份行、列所組成的新矩陣。 它亦可用A(3;2)表示，顯示除掉第3行和第2列的餘下的矩陣。 這兩種方法比較常用，但還是沒有標準的方法表示子矩陣。","线性方程组是数学方程组的一种，它符合以下的形式： 其中的以及等等是已知的常数，而等等则是要求的未知数。 如果用线性代数中的概念来表达，则线性方程组可以写成： 这里的A是m×n 矩阵，x是含有n个元素列向量，b是含有m 个元素列向量。","在数学中，线性映射是在两个向量空间之间的一种保持向量加法和标量乘法的特殊映射。线性映射从抽象代数角度看是向量空间的同态，从范畴论角度看是在给定的域上的向量空间所构成的范畴中的态射。","方塊矩陣，或简称方阵，是行數及列數皆相同的矩陣。 由 矩陣組成的集合，連同矩陣加法和矩陣乘法，构成環。 除了 ，此環並不是交换環。 M(n, R)，即實方塊矩陣環，是個實有单位的結合代數","单位阵是单位矩阵的简称，它指的是主对角线上都是1，其余元素皆为0的矩阵。 在矩阵的乘法中,有一种矩阵起着特殊的作用，如同数的乘法中的1，我们称这种矩阵为单位矩阵，简称单位阵"))

names(math) <- c("V1","V2")

QA=function(question,math){  #首先创建语料库
  library(text2vec)
  math_sample = math
  it = itoken(as.character(math_sample$V1),
              tokenizer = word_tokenizer)  # Creates a vocabulary of unique terms
  v = create_vocabulary(it)  
  #remove very common and uncommon words
  #这个功能可以过滤输入词汇，并抛出非常频繁且非常罕见的词汇。
  pruned_vocab = prune_vocabulary(v, term_count_min = 1,
                                  doc_proportion_max = 0.5, doc_proportion_min = 0.001)  #该函数创建一个文本向量化函数，用于构建一个dtm / tcm /语料库。
  vectorizer = vocab_vectorizer(pruned_vocab)  
  # 创建语料数据的dtm
  it = itoken(as.character(math_sample$V1), preprocess_function = tolower,
              tokenizer = word_tokenizer)  #Document-term matrix construction
  dtm_raw = create_dtm(it, vectorizer)  
  # 创建问题的dtm
  it = itoken(question, preprocess_function = tolower,
              tokenizer = word_tokenizer)  #Document-term matrix construction
  dtm_question = create_dtm(it, vectorizer)  
  # 找到最相近的问题
  n=which(as.matrix(sim2(dtm_raw,dtm_question))==max(sim2(dtm_raw,dtm_question)))
  
  print(paste("你需要的回答是：",math_sample$V2[n]))
  
}

然后就可以通过QA函数来获取我们的答案了：

QA("子矩阵是什么",math = math)

## [1] "你需要的回答是： 子矩陣是在矩陣選取部份行、列所組成的新矩陣。 它亦可用A(3;2)表示，顯示除掉第3行和第2列的餘下的矩陣。 這兩種方法比較常用，但還是沒有標準的方法表示子矩陣。"

所以，在这里我们可以了解到，构建一个自然语言功能的系统，算法是一方面，另外一方面是有没有充足的语料。

本站是提供个人知识管理的网络存储空间，所有内容均由用户发布，不代表本站观点。请注意甄别内容中的联系方式、诱导购买等信息，谨防诈骗。如发现有害或侵权内容，请点击一键举报。

转藏分享

QQ空间 QQ好友新浪微博微信

献花（0） +1

来自：脑系科数据科学 > 《数据科学》

举报/认领

0条评论

发表

请遵守用户评论公约

类似文章 更多

脑系科数据科学

科学领域优质作者

关注对话

TA的最新馆藏

用最通俗的语言说清楚人体一天的能量需求
随机效应和残差方差
pycaret处理多协变量时间序列
利用pgAdmin，进行数据库表格的增删
利用python和postgresql构建自己的数据库
power analysis

喜欢该文的人也喜欢更多

热门阅读换一换