CN108984526A

CN108984526A - A kind of document subject matter vector abstracting method based on deep learning

Info

Publication number: CN108984526A
Application number: CN201810748564.1A
Authority: CN
Inventors: 高扬; 黄河燕; 陆池
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2018-07-10
Filing date: 2018-07-10
Publication date: 2018-12-11
Anticipated expiration: 2038-07-10
Also published as: CN108984526B

Abstract

The invention relates to a method for extracting document topic vectors based on deep learning, and belongs to the technical field of natural language processing. The method of the present invention uses the convolutional neural network to extract local deep semantic information, uses the LSTM model to learn the time series information, makes the semantics of the vector more comprehensive, and uses the implicit co-occurrence relationship between contextual phrases and document topics to avoid For the shortcomings of short texts, some sentence-based topic vector models use the attention mechanism to organically combine CNN and LSTM models to learn the deep semantics, timing information and salient information of the context, and more effectively build a file topic vector. Model.

Description

A Method of Document Topic Vector Extraction Based on Deep Learning

技术领域technical field

本发明涉及一种基于深度学习的文档主题向量抽取方法，属于自然语言处理技术领域。The invention relates to a method for extracting document topic vectors based on deep learning, and belongs to the technical field of natural language processing.

背景技术Background technique

在如今的大数据时代，如何发现海量互联网文本数据的主题是-个研究重点。对文本数据的主题进行分析，文档主题向量本质上是表示文档的深层语义，是主题和语义的内在结合。抽取出文档主题向量可以广泛的应用于自然语言处理任务中，包括社交网络和新媒体的舆情分析、新闻热点的及时获取等等。因此，如何高效的抽取出文档主题向量是-个重要研究课题。In today's big data era, how to discover the topics of massive Internet text data is a research focus. To analyze the theme of text data, the document theme vector essentially represents the deep semantics of the document, and is the internal combination of theme and semantics. The extracted document topic vector can be widely used in natural language processing tasks, including public opinion analysis of social networks and new media, timely acquisition of news hotspots, etc. Therefore, how to efficiently extract document topic vectors is an important research topic.

对于文本数据而言，其主题并不一定直接体现在具体的文字内容上，这就使得挖掘文本隐含的主题变得困难，需要根据文本的单词、句子、段落等关系来提取出文档所包含的主题意义，并结合文档的篇章关系从而提取出文档的主题。近些年随着统计自然语言处理方法和语料库的丰富，基于“词语-主题”“文档-主题”的文本主题建模方法也相继被提出，其基本思想在于假设每个词语和文档的主题是服从一个统计概率分布，通过对语料数据的训练，计算出其文档主题的概率分布，然后再根据这个文档主题进行聚类。For text data, its theme is not necessarily directly reflected in the specific text content, which makes it difficult to mine the hidden theme of the text. It is necessary to extract the content contained in the document according to the relationship between words, sentences, and paragraphs in the text. The theme meaning of the document is combined with the chapter relationship of the document to extract the theme of the document. In recent years, with the enrichment of statistical natural language processing methods and corpora, text topic modeling methods based on "word-topic" and "document-topic" have also been proposed. The basic idea is to assume that the topic of each word and document is Obey a statistical probability distribution, through the training of corpus data, calculate the probability distribution of its document topic, and then cluster according to this document topic.

要正确分析出每个文档的主题，传统方法是对文本的每个词都进行主题分析，但是这种方法存在一个很大的问题：真正决定文本主题的词语其实只占该文本词语的少部分，因此传统方法会对与主题无关的词语进行大量的分析，这一方面无关词语导致实现起来计算量大，另一方面也存在着对于文本主题提取不精确，不能结合文本内在关联度关系挖掘文本深层语义的问题。To correctly analyze the theme of each document, the traditional method is to perform theme analysis on each word of the text, but this method has a big problem: the words that really determine the theme of the text actually only account for a small part of the words in the text , so the traditional method will analyze a large number of words that have nothing to do with the topic. On the one hand, the irrelevant words lead to a large amount of calculation. deep semantic issues.

随着硬件性能的提升以及数据规模的不断扩大，深度学习亦被广泛应用于各个领域之中，在其原有基础上大幅度提升了实验结果。深度学习以其优雅的模型、灵活的架构等特点，近些年结合单词Embedding和文档Embedding的方法中，得到了广泛的运用。在所有的深度学习方法中，CNN(Convolutional Neural Network，卷积神经网络)和LSTM模型(LongShort-Term Memory，长短期记忆网络模型)是最主流的两个。在自然语言处理任务中，基于CNN和LSTM模型的文本分析方法能够很好的发现文本的潜在语义特征，在语义分析计算上给予诸如自动文摘、情感分析、机器翻译等自然语言处理任务极大的帮助。With the improvement of hardware performance and the continuous expansion of data scale, deep learning has also been widely used in various fields, greatly improving the experimental results on its original basis. Deep learning, with its elegant model and flexible architecture, has been widely used in recent years in the method of combining word embedding and document embedding. Among all deep learning methods, CNN (Convolutional Neural Network, Convolutional Neural Network) and LSTM model (LongShort-Term Memory, long-term short-term memory network model) are the two most mainstream. In natural language processing tasks, text analysis methods based on CNN and LSTM models can well discover the latent semantic features of text, and give natural language processing tasks such as automatic summarization, sentiment analysis, and machine translation a great advantage in semantic analysis calculations. help.

发明内容Contents of the invention

本发明的目的是为了克服现有技术的缺陷，解决如何结合文本内在关联度关系挖掘文本深层语义的问题，提出一种基于深度学习的文档主题向量抽取方法。本发明把文档主题向量建模更多的聚焦在对文档主题特征向量的分析上，挖掘出文本特征和主题向量隐含的相关性，从而学习文档主题向量。The purpose of the present invention is to overcome the defects of the prior art, solve the problem of how to mine the deep semantics of the text in combination with the internal relevance relationship of the text, and propose a document topic vector extraction method based on deep learning. The present invention focuses more on the analysis of document topic feature vectors in the modeling of document topic vectors, digs out the implicit correlation between text features and topic vectors, and thereby learns document topic vectors.

本发明的核心思想为：利用CNN提取上下文短语的语义，将提取出来的语义输入到LSTM模型中，利用注意力机制提取文本的不同位置和不同意义词语的重要性，从而保留了重要信息，也完成了CNN和LSTM模型的有机结合，挖掘出上下文之间的内在关联，学习了具有深层语义和显著的文档主题向量。The core idea of the present invention is: use CNN to extract the semantics of contextual phrases, input the extracted semantics into the LSTM model, and use the attention mechanism to extract the importance of different positions of the text and words with different meanings, thereby retaining important information and also The organic combination of CNN and LSTM models has been completed, the intrinsic correlation between contexts has been mined, and document topic vectors with deep semantics and significant features have been learned.

本发明方法是通过下述技术方案实现的。The method of the present invention is realized through the following technical solutions.

一种基于深度学习的文档主题向量抽取方法，包括以下步骤：A method for extracting document topic vectors based on deep learning, comprising the following steps:

步骤一、进行相关定义，具体如下：Step 1. Define the relevant definitions, as follows:

定义1：文档D，D＝[w₁,w₂,...,w_i,...,w_n]，w_i表示文档D的第i个单词；Definition 1: Document D, D=[w ₁ ,w ₂ ,..., _wi ,...,w _n ], w _i represents the i-th word of document D;

定义2：预测单词w_d+1，表示需要学习的目标单词；Definition 2: The predicted word w _d+1 represents the target word that needs to be learned;

定义3：窗口单词，由文本中几个连续出现的单词构成，窗口单词之间存在隐藏的内在关联；Definition 3: Window words are composed of several words that appear consecutively in the text, and there are hidden internal associations between window words;

定义4：上下文短语，表示预测单词所在位置之前出现的窗口单词，窗口长度为l，上下文短语记为w_d-l,w_d-l+1,...,w_d；Definition 4: context phrase, which means the window word that appears before the position of the predicted word, the window length is l, and the context phrase is recorded as w _dl , w _d-l+1 ,...,w _d ;

定义5：文档主题映射矩阵，通过LDA算法(Latent Dirichlet Allocation)学习得到，每一行代表一个文档的主题；Definition 5: Document topic mapping matrix, learned by LDA algorithm (Latent Dirichlet Allocation), each row represents the topic of a document;

定义6：N_d和doc_id，N_d表示语料中文档的个数，doc_id表示文档的位置；每一个文档对应唯一的一个doc_id，其中，1≤doc_id≤N_d；Definition 6: N _d and doc _id , N _d represents the number of documents in the corpus, and doc _id represents the location of the document; each document corresponds to a unique doc _id , where 1≤doc _id≤N _d ;

步骤二、利用CNN，学习得到上下文短语的语义向量。Step 2. Use CNN to learn the semantic vector of the context phrase.

步骤三、利用LSTM模型学习上下文短语的语义，获得隐含层向量h_d-l,h_d-l+1,...,h_d。Step 3: Use the LSTM model to learn the semantics of contextual phrases, and obtain hidden layer vectors h _dl , h _d-l+1 ,...,h _d .

步骤四、通过注意力机制，将CNN和LSTM模型有机结合，获得上下文短语语义向量的平均值 Step 4. Through the attention mechanism, combine the CNN and LSTM models organically to obtain the average value of the contextual phrase semantic vector

步骤五、通过逻辑回归的方法，利用上下文短语语义向量的平均值和文档主题信息预测目标单词w_d+1，获得目标单词w_d+1的预测概率。Step 5. Use the average value of contextual phrase semantic vectors by logistic regression Predict the target word w _d+1 with the document topic information, and obtain the predicted probability of the target word w _d+1 .

有益效果Beneficial effect

本发明一种基于深度学习的文档主题向量抽取方法，对比现有技术，具有如下有益效果：A method for extracting document topic vectors based on deep learning in the present invention, compared with the prior art, has the following beneficial effects:

1.利用CNN抽取出具有局部的深层的语义信息；1. Use CNN to extract local deep semantic information;

2.利用LSTM模型将时序信息学习出来，使得向量的语义更加全面；2. Use the LSTM model to learn the timing information, making the semantics of the vector more comprehensive;

3.选用上下文短语和文档主题的隐含的共现关系，避免了一些基于句子的主题向量模型对于短文本的缺点；3. Select the implicit co-occurrence relationship between contextual phrases and document topics, avoiding the shortcomings of some sentence-based topic vector models for short texts;

4.利用注意力机制将CNN和LSTM模型有机的结合起来，学习了上下文的深层语义、时序信息和显著信息，更有效的构建了档主题向量抽取的模型。4. Using the attention mechanism to organically combine the CNN and LSTM models, learn the deep semantics, timing information and salient information of the context, and more effectively build a model for file topic vector extraction.

附图说明Description of drawings

图1为本发明一种基于深度学习的文档主题向量抽取方法的流程图。FIG. 1 is a flowchart of a method for extracting document topic vectors based on deep learning in the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下根据附图及实施例对本发明所述的文摘方法进一步详细说明。In order to make the object, technical solution and advantages of the present invention clearer, the abstracting method described in the present invention will be further described in detail below according to the drawings and embodiments.

一种基于深度学习的文档主题向量抽取方法，其基本实施过程如下：A document topic vector extraction method based on deep learning, the basic implementation process is as follows:

定义2：预测单词w_d+1；，表示需要学习的目标单词；Definition 2: predict word w _d+1 ;, indicating the target word that needs to be learned;

定义4：上下文短语(w_d-l,w_d-l+1,...,w_d)，表示预测单词所在位置之前出现的窗口单词，上下文短语长度为l；Definition 4: The context phrase (w _dl ,w _d-l+1 ,...,w _d ) indicates the window word that appears before the location of the predicted word, and the length of the context phrase is l;

定义5：文档主题映射矩阵，通过LDA算法学习得到，每一行代表一个文档的主题；Definition 5: Document topic mapping matrix, learned by LDA algorithm, each row represents the topic of a document;

步骤二、利用CNN，学习得到上下文短语的语义向量Context。Step 2. Use CNN to learn the semantic vector Context of the context phrase.

具体实现过程如下：The specific implementation process is as follows:

步骤2.1利用word2vec等算法,训练文档D的词向量矩阵，词向量矩阵大小为n×m，n表示词向量矩阵的长，m表词向量矩阵的宽；Step 2.1 Use word2vec and other algorithms to train the word vector matrix of document D. The size of the word vector matrix is n×m, where n represents the length of the word vector matrix, and m represents the width of the word vector matrix;

步骤2.2将上下文短语中每个单词对应的词向量从步骤2.1得到的词向量矩阵抽取出来，从而得到上下文短语w_d-l,w_d-l+1,...,w_d的向量矩阵M；Step 2.2 extracts the word vector corresponding to each word in the context phrase from the word vector matrix obtained in step 2.1, so as to obtain the vector matrix M of the context phrase w _dl , w _d-l+1 ,...,w _d ;

步骤2.3利用CNN计算上下文短语的语义向量Context。具体通过步骤2.2得到的向量矩阵M和K层大小为C_l×C_m的卷积核进行操作；Step 2.3 uses CNN to calculate the semantic vector Context of the context phrase. Specifically, the vector matrix M obtained in step 2.2 and the convolution kernel of the K layer with a size of C _l × C _m are operated;

其中，K表示卷积核的个数，本具体实施方式中K等于128，C_l表示卷积核的长，且C_l＝l，C_m表示卷积核的宽，且C_m＝m；Wherein, K represents the number of convolution kernels, K is equal to 128 in this specific embodiment, C ₁ represents the length of the convolution kernel, and C ₁ =1, C _m represents the width of the convolution kernel, and C _m =m;

上下文短语的语义向量Context通过公式(1)计算：The semantic vector Context of the context phrase is calculated by formula (1):

1≤k≤K1≤k≤K

Context＝[Context₁,Context₂,...,Context_K]Context＝[Context ₁ ,Context ₂ ,...,Context _K ]

其中，Context_k表示上下文短语的语义向量的第k维，l表示上下文短语长度，m表示词向量矩阵的宽，即词向量维度，d表示上下文短语中第一个单词的起始位置，c_pq是卷积核第p行和第q列的权重参数，M_pq表示向量矩阵M的第p行和第q列数据，b是卷积核的偏置参数；Among them, Context _k represents the k-th dimension of the semantic vector of the context phrase, l represents the length of the context phrase, m represents the width of the word vector matrix, that is, the word vector dimension, d represents the starting position of the first word in the context phrase, c _pq is the weight parameter of the pth row and qth column of the convolution kernel, M _pq represents the pth row and qth column data of the vector matrix M, and b is the bias parameter of the convolution kernel;

具体实现过程如下：The specific implementation process is as follows:

步骤3.1将t赋值d-l，即t＝d-l，t表示第t时刻；Step 3.1 assigns t to d-l, that is, t=d-l, and t represents the tth moment;

步骤3.2将x_t赋值w_t的词向量，x_t表示第t时刻输入的词向量，w_t表示第t时刻输入的单词；Step 3.2 Assign x _t to the word vector of w _t , where x _t represents the word vector input at the t-th moment, and w _t represents the word input at the t-th moment;

其中，w_t的词向量通过步骤2.1输出的词向量矩阵映射得到，即抽取w_t在向量矩阵M对应位置的词向量；Among them, the word vector of w _t is obtained by mapping the word vector matrix output in step 2.1, that is, extracting the word vector of w _t in the corresponding position of the vector matrix M;

步骤3.3将x_t作为LSTM模型的输入，获得t时刻的隐含层向量h_t；Step 3.3 takes x _t as the input of the LSTM model, and obtains the hidden layer vector h t at time _t ;

具体实现过程如下：The specific implementation process is as follows:

步骤3.3.1计算t时刻的遗忘门f_t，用于控制遗忘信息，通过公式(2)计算；Step 3.3.1 Calculate the forgetting gate f _{t at time t} , which is used to control the forgotten information, and is calculated by formula (2);

f_t＝σ(W_fx_t+U_fh_t-1+b_f) (2)f _t ＝σ(W _f x _t +U _f h _t-1 +b _f ) (2)

其中，W_f表示参数矩阵，x_t表示第t时刻输入的词向量，U_f表示参数矩阵，h_t-1表示t-1时刻的隐含层向量，b_f表示偏置向量参数，当t＝d-l时，h_t-1＝h_d-l-1，且h_d-l-1为零向量，σ表示Sigmoid函数，是LSTM模型的激活函数；Among them, W _f represents the parameter matrix, x _t represents the word vector input at time t, U _f represents the parameter matrix, h _t-1 represents the hidden layer vector at time t-1, b _f represents the bias vector parameter, when t = dl, h _t-1 = h _dl-1 , and h _dl-1 is a zero vector, σ represents the Sigmoid function, which is the activation function of the LSTM model;

步骤3.3.2计算t时刻的输入门i_t，用于控制当前时刻需要添加的新信息，通过公式(3)计算；Step 3.3.2 Calculate the input gate i _t at time t, which is used to control the new information that needs to be added at the current time, and is calculated by formula (3);

i_t＝σ(W_ix_t+U_ih_t-1+b_i) (3)i _t =σ(W _i x _t +U _i h _t-1 +b _i ) (3)

其中，W_i表示参数矩阵，x_t表示第t时刻输入的词向量，U_i表示参数矩阵，h_t-1表示t-1时刻的隐含层向量，b_i表示偏置向量参数，σ表示Sigmoid函数，是LSTM模型的激活函数；Among them, W _i represents the parameter matrix, x _t represents the word vector input at time t, U _i represents the parameter matrix, h _t-1 represents the hidden layer vector at time t-1, _bi represents the bias vector parameter, and σ represents The Sigmoid function is the activation function of the LSTM model;

步骤3.3.3计算t时刻更新的信息通过公式(4)计算；Step 3.3.3 Calculate the updated information at time t Calculated by formula (4);

其中，表示参数矩阵，x_t表示第t时刻输入的词向量，表示参数矩阵，h_t-1表示t-1时刻的隐含层向量，表示偏置向量参数，tanh表示双曲正切函数，是LSTM模型的激活函数；in, Represents the parameter matrix, x _t represents the word vector input at the tth moment, Represents the parameter matrix, h _t-1 represents the hidden layer vector at time t-1, Represents the bias vector parameter, tanh represents the hyperbolic tangent function, which is the activation function of the LSTM model;

步骤3.3.4计算t时刻的信息，将上一时刻的信息和当前时刻更新的信息相加得到，通过公式(5)计算；Step 3.3.4 calculates the information at time t, which is obtained by adding the information at the previous time and the updated information at the current time, and calculates by formula (5);

其中，c_t表示t时刻的信息，f_t表示t时刻遗忘门，c_t-1表示t-1时刻的信息，i_t表示t时刻的输入门，表示t时刻更新的信息，°表示向量的叉乘；Among them, c _t represents the information at time t, f _t represents the forgetting gate at time t, c _t-1 represents the information at time t-1, and it represents the input gate at time _t , Indicates the updated information at time t, ° indicates the cross product of vectors;

步骤3.3.5计算t时刻的输出门o_t，用于控制输入信息，通过公式(6)计算：Step 3.3.5 Calculate the output gate o _{t at time t} , which is used to control the input information, calculated by formula (6):

o_t＝σ(W_ox_t+U₀h_t-1+b_o) (6)o _t ＝σ(W _o x _t +U ₀ h _t-1 +b _o ) (6)

其中，W_o表示参数矩阵，x_t表示第t时刻输入的词向量，U₀表示参数矩阵，h_t-1表示t-1时刻的隐含层向量，b_o表示偏置向量参数，σ表示Sigmoid函数，是LSTM模型的激活函数；其中，步骤3.3.1-3.3.3和步骤3.3.5中的参数矩阵W_f，U_f，W_i，U_i，W_o，U_o的矩阵元素大小不同，偏置向量参数b_f，b_i，b_o中的元素大小不同；步骤3.3.6计算t时刻的隐含层向量h_t，通过公式(7)计算：Among them, W _o represents the parameter matrix, x _t represents the word vector input at time t, U ₀ represents the parameter matrix, h _t-1 represents the hidden layer vector at time t-1, b _o represents the bias vector parameter, σ represents The Sigmoid function is the activation function of the LSTM model; among them, the parameter matrix W _f , U _f , W _i , U _i in steps 3.3.1-3.3.3 and step 3.3.5, The matrix elements of W _o and U _o have different sizes, and the bias vector parameters b _f , b _i , The size of the elements in b _o is different; step 3.3.6 calculates the hidden layer vector h _{t at time t} , calculated by formula (7):

h_t＝o_tοc_t (7)h _t =o _t οc _t (7)

其中，o_t表示t时刻的输出门，c_t表示t时刻的信息；Among them, o _t represents the output gate at time t, and c _t represents the information at time t;

步骤3.4判断t是否等于d，若不等于则t加1，跳步骤3.2；若等于，则输出隐含层向量h_d-l,h_d-l+1,...,h_d，跳入步骤四；Step 3.4 Determine whether t is equal to d, if not, add 1 to t, and skip to step 3.2; if it is equal, output hidden layer vectors h _dl , h _d-l+1 ,...,h _d , and jump to step 4 ;

步骤四、利用注意力机制，将CNN和LSTM模型结合，获得上下文短语语义向量的平均值具体实现过程如下：Step 4. Use the attention mechanism to combine the CNN and LSTM models to obtain the average value of the context phrase semantic vector The specific implementation process is as follows:

步骤4.1利用步骤二得到的上下文短语语义向量，通过注意力机制得到每个单词在上下文短语的语义向量上的重要性因子α，具体通过公式(8)计算：Step 4.1 Use the semantic vector of the context phrase obtained in step 2 to obtain the importance factor α of each word on the semantic vector of the context phrase through the attention mechanism, specifically calculated by formula (8):

d-l≤t≤dd-l≤t≤d

α＝[α_d-l,α_d-l+1,...,α_d] (8)α＝[α _dl ,α _d-l+1 ,...,α _d ] (8)

其中，α_t表示t时刻单词在上下文短语的语义向量上的重要性因子，Context表示步骤二中获得的上下文短语的语义向量，x_t表示第t时刻输入的词向量，x_i表示第i时刻输入的词向量；T表示向量的转置；e表示以e，即自然常数为底的指数函数；Among them, α _t represents the importance factor of the word on the semantic vector of the context phrase at time t, Context represents the semantic vector of the context phrase obtained in step 2, x _t represents the word vector input at time t, and x _i represents time i The input word vector; T represents the transposition of the vector; e represents an exponential function based on e, which is a natural constant;

步骤4.2计算基于注意力机制带权重的隐含层向量h′，通过公式(9)计算；Step 4.2 calculates the weighted hidden layer vector h' based on the attention mechanism, which is calculated by formula (9);

h′_t＝α_t*h_t h′ _t =α _t *h _t

d-l≤t≤dd-l≤t≤d

h′＝[h′_d-l,h′_d-l+1,...,h′_d] (9)h'=[h' _dl ,h' _d-l+1 ,...,h' _d ] (9)

其中，h′_t表示t时刻权重隐含层向量h′_t，α_t表示t时刻每个单词在上下文短语的语义向量上的重要性因子，h_t表示t时刻隐含层向量；Among them, h′ _t represents the weight hidden layer vector h′ _{t at time t} , α _t represents the importance factor of each word on the semantic vector of the context phrase at time t, and h _t represents the hidden layer vector at time t;

步骤4.3利用mean-pooling操作，计算上下文短语语义向量的平均值通过公式(10)计算：Step 4.3 uses the mean-pooling operation to calculate the average value of the context phrase semantic vector Calculated by formula (10):

其中，h′_t表示t时刻权重隐含层向量h′_t；Among them, h′ _t represents the weight hidden layer vector h′ t at time _t ;

步骤五、通过逻辑回归的方法,利用上下文短语语义向量的平均值和文档主题信息预测目标单词w_d+1，获得目标单词w_d+1的预测概率。具体实现过程如下：Step 5: Predict the target word w _d+1 by using the mean value of the semantic vector of the context phrase and the topic information of the document by means of logistic regression, and obtain the predicted probability of the target word w _d+1 . The specific implementation process is as follows:

步骤5.1利用LDA算法学习文档主题映射矩阵，然后根据文档主题映射矩阵和doc_id将每一个文档映射成一个长度和步骤2.1中词向量矩阵宽度相等的-维向量D_z；Step 5.1 uses the LDA algorithm to learn the document topic mapping matrix, and then maps each document into a -dimensional vector D _z whose length is equal to the width of the word vector matrix in step 2.1 according to the document topic mapping matrix and doc _id ;

步骤5.2将步骤5.1输出的向量D_z和步骤四输出的上下文短语语义向量的平均值拼接起来，得到拼接向量V_d， Step 5.2 takes the average value of the vector D _z output in step 5.1 and the context phrase semantic vector output in step 4 spliced together to get the spliced vector V _d ,

步骤5.3利用步骤5.2输出的V_d来预测目标单词w_d+1。具体通过逻辑回归的方法进行分类，目标函数如公式(11)Step 5.3 uses V _d output from step 5.2 to predict the target word w _d+1 . Specifically, it is classified by the method of logistic regression, and the objective function is as formula (11)

其中，θ_d+1是目标单词w_d+1所在位置对应的参数，θ_i对应词表中单词w_i对应的参数，|V|表示词表的大小，V_d是步骤5.2得到的拼接向量，exp表示以e为底的指数函数，Σ表示求和；P表示概率，y表示因变量，T表示矩阵转置。Among them, θ _d+1 is the parameter corresponding to the position of the target word w _d+1 , θ _i corresponds to the parameter corresponding to the word w _i in the vocabulary, |V| represents the size of the vocabulary, and V _d is the splicing vector obtained in step 5.2 , exp represents the exponential function with e as the base, Σ represents the summation; P represents the probability, y represents the dependent variable, and T represents the matrix transposition.

步骤5.4利用交叉熵的方法，通过公式(12)计算目标函数(11)的损失函数：Step 5.4 uses the method of cross entropy to calculate the loss function of the objective function (11) by formula (12):

L＝-log(P(y＝w_d+1|V_d)) (12)L=-log(P(y=w _d+1 |V _d )) (12)

其中，w_d+1表示目标单词，V_d是步骤4.2的拼接向量，log()表示以10为底的对数函数；Among them, w _d+1 represents the target word, V _d is the splicing vector of step 4.2, and log() represents a logarithmic function with base 10;

损失函数(12)通过Sampled Softmax算法和小批量随机梯度下降参数更新方法进行更新求解，得到文档主题向量。The loss function (12) is updated and solved by the Sampled Softmax algorithm and the mini-batch stochastic gradient descent parameter update method to obtain the document topic vector.

至此，从步骤一到步骤五，完成了具有深层语义和显著的文档主题向量。So far, from step 1 to step 5, the document topic vector with deep semantics and salience is completed.

实施例Example

本实施例叙述了本发明的具体实施过程，如图1所示。This embodiment describes the specific implementation process of the present invention, as shown in FIG. 1 .

从图1可以看出，本发明一种基于深度学习的文档主题向量抽取方法的流程如下：As can be seen from Figure 1, the process of a method for extracting document topic vectors based on deep learning in the present invention is as follows:

步骤A预处理；首先去除掉语料中的无意义符号，如特殊字符等，然后对文本进行分词。分词就是将连续的文字序列按照既定的词法规则分割成单独的词语的过程，从而将句子分解为若干个连续的有意义的单词串用于后续分析。分词操作利用PTB分词器进行分词处理。在分词之后，对原始的文本构建词表，在本实施例中，词表选取的是训练文本的词频从高到底前20000个单词，也就是词表V的大小为20000。在词表选取之后，按照词表的索引构建出原始语料的词表索引数据，这个文本词表索引数据作为模型的输入。Step A preprocessing; first remove meaningless symbols in the corpus, such as special characters, etc., and then segment the text. Word segmentation is the process of dividing a continuous sequence of words into separate words according to established lexical rules, thereby decomposing a sentence into several continuous meaningful word strings for subsequent analysis. The word segmentation operation uses the PTB tokenizer for word segmentation processing. After the word segmentation, a vocabulary is constructed for the original text. In this embodiment, the vocabulary is selected from the top 20,000 words in the training text from high to low, that is, the size of the vocabulary V is 20,000. After the vocabulary is selected, the vocabulary index data of the original corpus is constructed according to the vocabulary index, and this text vocabulary index data is used as the input of the model.

步骤B利用word2vec算法学习词向量。将文档中单词输入到word2vec算法中，得到词向量，其目标函数如公式(13)：Step B uses the word2vec algorithm to learn word vectors. Input the words in the document into the word2vec algorithm to get the word vector, and its objective function is as formula (13):

其中，k为窗口单词，i为当前单词，Corp为语料库中单词大小，利用梯度下降方法学习得到128维的词向量；Among them, k is the window word, i is the current word, Corp is the size of the word in the corpus, and the 128-dimensional word vector is learned by using the gradient descent method;

步骤C利用CNN抽取上下文短语语义向量，利用RNN学习上下文短语隐含层向量；Step C uses CNN to extract contextual phrase semantic vectors, and uses RNN to learn contextual phrase hidden layer vectors;

其中，利用CNN抽取上下文短语语义向量，利用RNN学习上下文短语隐含层向量是并列计算的，具体到本实施例：Wherein, using CNN to extract contextual phrase semantic vectors, using RNN to learn contextual phrase hidden layer vectors are calculated in parallel, specific to this embodiment:

利用CNN抽取上下文短语语义向量；首先利用高斯分布进行随机初始化一个K层的大小为C_l×C_m的卷积核，对于给定的上下文短语w_d-l,w_d-l+1,...,w_d，通过步骤B学到的词向量将这些上下文短语映射成大小为l×m的矩阵，其中，l是上下文短语的长度，m为词向量的维度，将该矩阵在随机初始化的卷积核上进行卷积操作，具体操作方式如公式(1)所示，这样就得到了一个向量Context，该向量就是上下文短语的语义向量；Use CNN to extract contextual phrase semantic vector; first use Gaussian distribution to randomly initialize a K-layer convolution kernel with a size of C _l ×C _m , for a given contextual phrase w _dl ,w _d-l+1 ,... ,w _d , through the word vector learned in step B, these context phrases are mapped into a matrix of size l×m, where l is the length of the context phrase, m is the dimension of the word vector, and the matrix is randomly initialized in the volume The convolution operation is performed on the product kernel, and the specific operation method is shown in formula (1), so that a vector Context is obtained, which is the semantic vector of the context phrase;

利用RNN学习上下文短语隐含层向量；将上下文短语w_d-l,w_d-l+1,...,w_d对应的词向量按顺序输入到LSTM模型中，将0时刻的隐含层向量h₀的每一个维度设置为0，然后利用公式(2)-(7)依次计算遗忘门，输入门、输出门和最终结果上下文短语隐含层向量，维度大小设置为128；Use RNN to learn the hidden layer vector of contextual phrases; input the word vectors corresponding to contextual phrases w _dl ,w _d-l+1 ,...,w _d into the LSTM model in order, and the hidden layer vector h at time 0 Each dimension of ₀ is set to 0, and then the forgetting gate, input gate, output gate and final result context phrase hidden layer vector are calculated sequentially using formulas (2)-(7), and the dimension size is set to 128;

步骤D利用注意力机制计算带权重的语义向量、计算文档主题分布；Step D uses the attention mechanism to calculate the weighted semantic vector and calculate the topic distribution of the document;

其中，利用注意力机制计算带权重的语义向量、计算文档主题分布是并列计算的，具体到本实施例：Among them, the use of the attention mechanism to calculate the weighted semantic vector and the calculation of the topic distribution of the document are calculated in parallel, specific to this embodiment:

利用注意力机制计算带权重的语义向量；根据步骤B得到的词向量和步骤C得到的上下文短语的语义向量，对于上下文短语中的每一个单词进行注意力机制操作，求得注意力因子α_t，α_t是一个在0到1之间的实数，其数字越大，那么其对应位置的词向量信息就会越多的保留在最后的mean-pooling层中，因此其大小表示了当前词语在表征整个短语意义的重要性，也就是说越重要的词语将会得到更多的注意；Use the attention mechanism to calculate the weighted semantic vector; according to the word vector obtained in step B and the semantic vector of the context phrase obtained in step C, perform an attention mechanism operation on each word in the context phrase to obtain the attention factor α _t , α _t is a real number between 0 and 1. The larger the number, the more word vector information in its corresponding position will be retained in the last mean-pooling layer, so its size represents the current word in The importance of representing the meaning of the entire phrase, that is, the more important words will receive more attention;

计算文档主题分布；具体是利用LDA算法计算，首先将文档D输入到LDA算法中，得到每一个文档D的主题分布，该主题分布直接作为最终结果，记为D_z；Calculate the document topic distribution; specifically, use the LDA algorithm to calculate, first input the document D into the LDA algorithm, and obtain the topic distribution of each document D, and the topic distribution is directly used as the final result, which is recorded as D _z ;

步骤E预测目标单词，学习文档主题向量；将带权重的语义向量和D_z直接拼接起来，然后使得目标单词出现的概率最大，由Sampled Softmax算法和小批量随机梯度下降参数更新方法即可求得文档主题向量。Step E predicts the target word and learns the document topic vector; directly splices the weighted semantic vector and _Dz , and then maximizes the probability of the target word, which can be obtained by the Sampled Softmax algorithm and the small-batch stochastic gradient descent parameter update method Document topic vector.

Claims

1. A document theme vector extraction method based on deep learning is characterized by comprising the following steps:

step one, performing relevant definition, specifically as follows:

definition 1: document D, D ═ w₁,w₂,...,w_i,...,w_n]，w_iThe ith word representing document D;

definition 2: predicting word w_d+1(ii) a Representing a target word to be learned;

definition 3: window words which are formed by words continuously appearing in the text, and hidden internal association exists between the window words;

definition 4: the contextual phrase: w is a_d-l,w_d-l+1,...,w_dThe word is a window word appearing before the position of the predicted word, and the length of the context phrase is l;

definition 5: the document theme mapping matrix is obtained by learning of an LDA algorithm, and each row represents a theme of a document;

definition 6: n is a radical of_dAnd doc_id，N_dRepresenting the number of documents in the corpus, doc_idRepresenting a location of a document; each document corresponds to a unique doc_idWherein, 1 is less than or equal to doc_id≤N_d；

Learning to obtain a semantic vector of the context phrase by using a Convolutional Neural Network (CNN);

thirdly, learning the semantics of the context phrase by utilizing the long-short term memory network model LSTM to obtain a hidden layer vector h_d-l,h_d-l+1,...，h_d(ii) a The specific implementation process is as follows:

step 3.1, assigning t to d-l, namely t is d-l, and t represents the t-th moment;

step 3.2 reaction of x_tAssignment w_tWord vector of x_tRepresenting the word vector entered at time t, w_tA word indicating input at the t-th time;

wherein, w_tThe word vectors are obtained by mapping the word vector matrix output in step 2.1, i.e. extracting w_tWord vectors at the corresponding positions of the vector matrix M;

step 3.3 reaction of x_tObtaining a hidden layer vector h at the time t as an input of an LSTM model_t；

Step 3.4, judging whether t is equal to d, if not, adding 1 to t, and skipping to step 3.2; if yes, the hidden layer vector h is output_d-l,h_d-l+1,...,h_dJumping to the step four;

step four, organically combining the CNN model and the LSTM model through an attention mechanism to obtain the average value of the semantic vector of the context phraseThe specific implementation method comprises the following steps:

step 4.1, obtaining an importance factor α of each word on the semantic vector of the context phrase through an attention mechanism by using the context phrase semantic vector obtained in the step two, and specifically calculating through the following formula:

d-l≤t≤d

α＝[α_d-l,α_d-l+1,...,α_d]

wherein alpha is_tRepresenting the importance factor of the word at the time t on the semantic vector of the Context phrase, Context representing the semantic vector of the Context phrase obtained in the step two, x_tRepresenting the word vector, x, input at time t_iRepresenting the word vector input at the ith moment; t represents the transposition of the vector; e represents an exponential function with e, a natural constant, as the base;

step 4.2, calculating a hidden layer vector h' with weight based on attention mechanism by the following formula;

h_t′＝α_t*h_t

d-l≤t≤d

h′＝[h′_d-l,h′_d-l+1,...,h_d′]

wherein h is_t' denotes the weighted hidden layer vector h at time t_t′，α_tAn importance factor, h, representing each word at time t on the semantic vector of the context phrase_tRepresenting a hidden layer vector at the moment t;

step 4.3 calculate the mean of the semantic vector of the context phrase using mean-posing operationCalculated by the following equation (10):

wherein h is_t' denotes the weighted hidden layer vector h at time t_t′；

Step five, utilizing the average value of the semantic vector of the context phrase by a logistic regression methodAnd the document topic information prediction target word w_d+1Obtaining the target word w_d+1The prediction probability of (2).

2. The method for extracting document theme vectors based on deep learning of claim 1, wherein the second step is implemented by the following steps:

step 2.1, training a word vector matrix of the document D, wherein the size of the word vector matrix is n multiplied by m, n represents the length of the word vector matrix, and m represents the width of the word vector matrix;

step 2.2, extracting the word vector corresponding to each word in the context phrase from the word vector matrix obtained in step 2.1 to obtain the context phrase w_d-l,w_d-l+1,...,w_dA vector matrix M of (a);

step 2.3, calculating the semantic vector Context of the Context phrase by using the CNN, wherein the vector matrix M and the K layers obtained in the step 2.2 have the size of C_l×C_mThe convolution kernel of (a) operates;

where K denotes the number of convolution kernels, C_lRepresents the length of the convolution kernel, and C_l＝l，C_mRepresents the width of the convolution kernel, and C_m＝m；

The semantic vector Context of a Context phrase is calculated by equation (1):

1≤k≤K

Context＝[Context₁,Context₂,...,Context_K]

wherein, Context_kIndicating short contextThe kth dimension of the semantic vector of a word, l represents the length of the context phrase, m represents the width of the word vector matrix, i.e., the word vector dimension, d represents the starting position of the first word in the context phrase, c_pqIs the weight parameter of the p-th row and q-th column of the convolution kernel, M_pqLine p and column q data representing the vector matrix M, b is the bias parameter for the convolution kernel.

3. The method for extracting document theme vector based on deep learning of claim 1, wherein the specific implementation method of the step 3.3 is as follows:

step 3.3.1 calculating forgetting door f at t moment_tThe forgetting control module is used for controlling forgetting information and calculating through a formula (2);

f_t＝σ(W_fx_t+U_fh_t-1+b_f) (2)

wherein, W_fRepresenting a parameter matrix, x_tRepresenting the word vector, U, input at time t_fRepresents a parameter matrix, h_t-1Representing the hidden layer vector at time t-1, b_fDenotes the offset vector parameter, when t is d-l, h_t-1＝h_d-l-1And h is_d-l-1Is a zero vector, sigma represents a Sigmoid function, and is an activation function of the LSTM model;

step 3.3.2 input Gate i at time t_tThe new information to be added at the current moment is controlled and calculated through a formula (3);

i_t＝σ(W_ix_t+U_ih_t-1+b_i) (3)

wherein, W_iRepresenting a parameter matrix, x_tRepresenting the word vector, U, input at time t_iRepresents a parameter matrix, h_t-1Representing the hidden layer vector at time t-1, b_iRepresenting a bias vector parameter, and sigma representing a Sigmoid function, which is an activation function of an LSTM model;

step 3.3.3 calculating updated information at time tBy the formula (4) Calculating;

wherein,representing a parameter matrix, x_tRepresenting the word vector entered at time t,represents a parameter matrix, h_t-1Representing the hidden layer vector at time t-1,representing a bias vector parameter, and tanh representing a hyperbolic tangent function, which is an activation function of an LSTM model;

step 3.3.4, calculating the information of the t moment, adding the information of the previous moment and the updated information of the current moment to obtain the information, and calculating the information through a formula (5);

wherein, c_tInformation indicating time t, f_tIndicating forgetting to leave door at time t, c_t-1Information indicating the time t-1, i_tThe input gate at time t is shown,information indicating the update at time t is provided,represents a cross product of the vectors;

step 3.3.5 output gate o at time t is calculated_tFor controlling the input information, calculated by equation (6):

o_t＝σ(W_ox_t+U₀h_t-1+b_o) (6)

wherein, W_oRepresenting a parameter matrix, x_tRepresenting the word vector, U, input at time t₀Represents a parameter matrix, h_t-1Representing the hidden layer vector at time t-1, b_oRepresenting a bias vector parameter, and sigma representing a Sigmoid function, which is an activation function of an LSTM model; wherein the parameter matrix W_f，U_f，W_i，U_i，W_o，U_oHas different matrix element sizes, and is biased with vector parameter b_f，b_i，b_oThe elements in (A) are different in size;

step 3.3.6 calculating the hidden layer vector h at time t_tCalculated by equation (7):

wherein o is_tOutput gate representing time t, c_tInformation indicating time t.

4. The method for extracting document theme vectors based on deep learning of claim 1, wherein the concrete implementation method of the fifth step is as follows:

step 5.1 learning the document theme mapping matrix, then according to the document theme mapping matrix and doc_idEach document is mapped into a-dimensional vector D of equal length and width of the word vector matrix in step 2.1_z；

Step 5.2 vector D output from step 5.1_zAnd the average value of the context phrase semantic vector output in the step fourSplicing to obtain a splicing vector V_d，

Step 5.3 uses the V output from step 5.2_dTo predict the target word w_d+1；

Step 5.4, calculating a loss function of the target function (11) by using a cross entropy method through a formula (12):

L＝-log(P(y＝w_d+1|V_d)) (12)

wherein, w_d+1Representing a target word, V_dIs the concatenation vector of step 4.2, log () represents a base-10 logarithmic function;

and (3) updating and solving the loss function (12) by a Sampled Softmax algorithm and a small batch random gradient descent parameter updating method to obtain a document theme vector.

5. The method for extracting document theme vector based on deep learning as claimed in claim 4, wherein in the step 5.3, classification is performed by a logistic regression method, and the objective function is as the formula (11)

Wherein, theta_d+1Is the target word w_d+1The parameter, theta, corresponding to the location_iWord w in the corresponding word list_iCorresponding parameters, | V | representing the size of the vocabulary, V_dThe splicing vector obtained in the step 5.2 is used, exp represents an exponential function with e as a base, and sigma represents summation; p denotes probability, y denotes dependent variable, and T denotes matrix transposition.