CN108984526A - A kind of document subject matter vector abstracting method based on deep learning - Google Patents

A kind of document subject matter vector abstracting method based on deep learning Download PDF

Info

Publication number
CN108984526A
CN108984526A CN201810748564.1A CN201810748564A CN108984526A CN 108984526 A CN108984526 A CN 108984526A CN 201810748564 A CN201810748564 A CN 201810748564A CN 108984526 A CN108984526 A CN 108984526A
Authority
CN
China
Prior art keywords
vector
word
representing
time
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810748564.1A
Other languages
Chinese (zh)
Other versions
CN108984526B (en
Inventor
高扬
黄河燕
陆池
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN201810748564.1A priority Critical patent/CN108984526B/en
Publication of CN108984526A publication Critical patent/CN108984526A/en
Application granted granted Critical
Publication of CN108984526B publication Critical patent/CN108984526B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

本发明涉及一种基于深度学习的文档主题向量抽取方法,属于自然语言处理技术领域。本发明方法利用卷积神经网络抽取出具有局部的深层的语义信息,利用LSTM模型将时序信息学习出来,使得向量的语义更加全面,选用上下文短语和文档主题的隐含的共现关系,避免了一些基于句子的主题向量模型对于短文本的缺点,利用注意力机制将CNN和LSTM模型有机的结合起来,学习了上下文的深层语义、时序信息和显著信息,更有效的构建了档主题向量抽取的模型。

The invention relates to a method for extracting document topic vectors based on deep learning, and belongs to the technical field of natural language processing. The method of the present invention uses the convolutional neural network to extract local deep semantic information, uses the LSTM model to learn the time series information, makes the semantics of the vector more comprehensive, and uses the implicit co-occurrence relationship between contextual phrases and document topics to avoid For the shortcomings of short texts, some sentence-based topic vector models use the attention mechanism to organically combine CNN and LSTM models to learn the deep semantics, timing information and salient information of the context, and more effectively build a file topic vector. Model.

Description

一种基于深度学习的文档主题向量抽取方法A Method of Document Topic Vector Extraction Based on Deep Learning

技术领域technical field

本发明涉及一种基于深度学习的文档主题向量抽取方法,属于自然语言处理技术领域。The invention relates to a method for extracting document topic vectors based on deep learning, and belongs to the technical field of natural language processing.

背景技术Background technique

在如今的大数据时代,如何发现海量互联网文本数据的主题是-个研究重点。对文本数据的主题进行分析,文档主题向量本质上是表示文档的深层语义,是主题和语义的内在结合。抽取出文档主题向量可以广泛的应用于自然语言处理任务中,包括社交网络和新媒体的舆情分析、新闻热点的及时获取等等。因此,如何高效的抽取出文档主题向量是-个重要研究课题。In today's big data era, how to discover the topics of massive Internet text data is a research focus. To analyze the theme of text data, the document theme vector essentially represents the deep semantics of the document, and is the internal combination of theme and semantics. The extracted document topic vector can be widely used in natural language processing tasks, including public opinion analysis of social networks and new media, timely acquisition of news hotspots, etc. Therefore, how to efficiently extract document topic vectors is an important research topic.

对于文本数据而言,其主题并不一定直接体现在具体的文字内容上,这就使得挖掘文本隐含的主题变得困难,需要根据文本的单词、句子、段落等关系来提取出文档所包含的主题意义,并结合文档的篇章关系从而提取出文档的主题。近些年随着统计自然语言处理方法和语料库的丰富,基于“词语-主题”“文档-主题”的文本主题建模方法也相继被提出,其基本思想在于假设每个词语和文档的主题是服从一个统计概率分布,通过对语料数据的训练,计算出其文档主题的概率分布,然后再根据这个文档主题进行聚类。For text data, its theme is not necessarily directly reflected in the specific text content, which makes it difficult to mine the hidden theme of the text. It is necessary to extract the content contained in the document according to the relationship between words, sentences, and paragraphs in the text. The theme meaning of the document is combined with the chapter relationship of the document to extract the theme of the document. In recent years, with the enrichment of statistical natural language processing methods and corpora, text topic modeling methods based on "word-topic" and "document-topic" have also been proposed. The basic idea is to assume that the topic of each word and document is Obey a statistical probability distribution, through the training of corpus data, calculate the probability distribution of its document topic, and then cluster according to this document topic.

要正确分析出每个文档的主题,传统方法是对文本的每个词都进行主题分析,但是这种方法存在一个很大的问题:真正决定文本主题的词语其实只占该文本词语的少部分,因此传统方法会对与主题无关的词语进行大量的分析,这一方面无关词语导致实现起来计算量大,另一方面也存在着对于文本主题提取不精确,不能结合文本内在关联度关系挖掘文本深层语义的问题。To correctly analyze the theme of each document, the traditional method is to perform theme analysis on each word of the text, but this method has a big problem: the words that really determine the theme of the text actually only account for a small part of the words in the text , so the traditional method will analyze a large number of words that have nothing to do with the topic. On the one hand, the irrelevant words lead to a large amount of calculation. deep semantic issues.

随着硬件性能的提升以及数据规模的不断扩大,深度学习亦被广泛应用于各个领域之中,在其原有基础上大幅度提升了实验结果。深度学习以其优雅的模型、灵活的架构等特点,近些年结合单词Embedding和文档Embedding的方法中,得到了广泛的运用。在所有的深度学习方法中,CNN(Convolutional Neural Network,卷积神经网络)和LSTM模型(LongShort-Term Memory,长短期记忆网络模型)是最主流的两个。在自然语言处理任务中,基于CNN和LSTM模型的文本分析方法能够很好的发现文本的潜在语义特征,在语义分析计算上给予诸如自动文摘、情感分析、机器翻译等自然语言处理任务极大的帮助。With the improvement of hardware performance and the continuous expansion of data scale, deep learning has also been widely used in various fields, greatly improving the experimental results on its original basis. Deep learning, with its elegant model and flexible architecture, has been widely used in recent years in the method of combining word embedding and document embedding. Among all deep learning methods, CNN (Convolutional Neural Network, Convolutional Neural Network) and LSTM model (LongShort-Term Memory, long-term short-term memory network model) are the two most mainstream. In natural language processing tasks, text analysis methods based on CNN and LSTM models can well discover the latent semantic features of text, and give natural language processing tasks such as automatic summarization, sentiment analysis, and machine translation a great advantage in semantic analysis calculations. help.

发明内容Contents of the invention

本发明的目的是为了克服现有技术的缺陷,解决如何结合文本内在关联度关系挖掘文本深层语义的问题,提出一种基于深度学习的文档主题向量抽取方法。本发明把文档主题向量建模更多的聚焦在对文档主题特征向量的分析上,挖掘出文本特征和主题向量隐含的相关性,从而学习文档主题向量。The purpose of the present invention is to overcome the defects of the prior art, solve the problem of how to mine the deep semantics of the text in combination with the internal relevance relationship of the text, and propose a document topic vector extraction method based on deep learning. The present invention focuses more on the analysis of document topic feature vectors in the modeling of document topic vectors, digs out the implicit correlation between text features and topic vectors, and thereby learns document topic vectors.

本发明的核心思想为:利用CNN提取上下文短语的语义,将提取出来的语义输入到LSTM模型中,利用注意力机制提取文本的不同位置和不同意义词语的重要性,从而保留了重要信息,也完成了CNN和LSTM模型的有机结合,挖掘出上下文之间的内在关联,学习了具有深层语义和显著的文档主题向量。The core idea of the present invention is: use CNN to extract the semantics of contextual phrases, input the extracted semantics into the LSTM model, and use the attention mechanism to extract the importance of different positions of the text and words with different meanings, thereby retaining important information and also The organic combination of CNN and LSTM models has been completed, the intrinsic correlation between contexts has been mined, and document topic vectors with deep semantics and significant features have been learned.

本发明方法是通过下述技术方案实现的。The method of the present invention is realized through the following technical solutions.

一种基于深度学习的文档主题向量抽取方法,包括以下步骤:A method for extracting document topic vectors based on deep learning, comprising the following steps:

步骤一、进行相关定义,具体如下:Step 1. Define the relevant definitions, as follows:

定义1:文档D,D=[w1,w2,...,wi,...,wn],wi表示文档D的第i个单词;Definition 1: Document D, D=[w 1 ,w 2 ,..., wi ,...,w n ], w i represents the i-th word of document D;

定义2:预测单词wd+1,表示需要学习的目标单词;Definition 2: The predicted word w d+1 represents the target word that needs to be learned;

定义3:窗口单词,由文本中几个连续出现的单词构成,窗口单词之间存在隐藏的内在关联;Definition 3: Window words are composed of several words that appear consecutively in the text, and there are hidden internal associations between window words;

定义4:上下文短语,表示预测单词所在位置之前出现的窗口单词,窗口长度为l,上下文短语记为wd-l,wd-l+1,...,wdDefinition 4: context phrase, which means the window word that appears before the position of the predicted word, the window length is l, and the context phrase is recorded as w dl , w d-l+1 ,...,w d ;

定义5:文档主题映射矩阵,通过LDA算法(Latent Dirichlet Allocation)学习得到,每一行代表一个文档的主题;Definition 5: Document topic mapping matrix, learned by LDA algorithm (Latent Dirichlet Allocation), each row represents the topic of a document;

定义6:Nd和docid,Nd表示语料中文档的个数,docid表示文档的位置;每一个文档对应唯一的一个docid,其中,1≤docid≤NdDefinition 6: N d and doc id , N d represents the number of documents in the corpus, and doc id represents the location of the document; each document corresponds to a unique doc id , where 1≤doc id≤N d ;

步骤二、利用CNN,学习得到上下文短语的语义向量。Step 2. Use CNN to learn the semantic vector of the context phrase.

步骤三、利用LSTM模型学习上下文短语的语义,获得隐含层向量hd-l,hd-l+1,...,hdStep 3: Use the LSTM model to learn the semantics of contextual phrases, and obtain hidden layer vectors h dl , h d-l+1 ,...,h d .

步骤四、通过注意力机制,将CNN和LSTM模型有机结合,获得上下文短语语义向量的平均值 Step 4. Through the attention mechanism, combine the CNN and LSTM models organically to obtain the average value of the contextual phrase semantic vector

步骤五、通过逻辑回归的方法,利用上下文短语语义向量的平均值和文档主题信息预测目标单词wd+1,获得目标单词wd+1的预测概率。Step 5. Use the average value of contextual phrase semantic vectors by logistic regression Predict the target word w d+1 with the document topic information, and obtain the predicted probability of the target word w d+1 .

有益效果Beneficial effect

本发明一种基于深度学习的文档主题向量抽取方法,对比现有技术,具有如下有益效果:A method for extracting document topic vectors based on deep learning in the present invention, compared with the prior art, has the following beneficial effects:

1.利用CNN抽取出具有局部的深层的语义信息;1. Use CNN to extract local deep semantic information;

2.利用LSTM模型将时序信息学习出来,使得向量的语义更加全面;2. Use the LSTM model to learn the timing information, making the semantics of the vector more comprehensive;

3.选用上下文短语和文档主题的隐含的共现关系,避免了一些基于句子的主题向量模型对于短文本的缺点;3. Select the implicit co-occurrence relationship between contextual phrases and document topics, avoiding the shortcomings of some sentence-based topic vector models for short texts;

4.利用注意力机制将CNN和LSTM模型有机的结合起来,学习了上下文的深层语义、时序信息和显著信息,更有效的构建了档主题向量抽取的模型。4. Using the attention mechanism to organically combine the CNN and LSTM models, learn the deep semantics, timing information and salient information of the context, and more effectively build a model for file topic vector extraction.

附图说明Description of drawings

图1为本发明一种基于深度学习的文档主题向量抽取方法的流程图。FIG. 1 is a flowchart of a method for extracting document topic vectors based on deep learning in the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白,以下根据附图及实施例对本发明所述的文摘方法进一步详细说明。In order to make the object, technical solution and advantages of the present invention clearer, the abstracting method described in the present invention will be further described in detail below according to the drawings and embodiments.

一种基于深度学习的文档主题向量抽取方法,其基本实施过程如下:A document topic vector extraction method based on deep learning, the basic implementation process is as follows:

步骤一、进行相关定义,具体如下:Step 1. Define the relevant definitions, as follows:

定义1:文档D,D=[w1,w2,...,wi,...,wn],wi表示文档D的第i个单词;Definition 1: Document D, D=[w 1 ,w 2 ,..., wi ,...,w n ], w i represents the i-th word of document D;

定义2:预测单词wd+1;,表示需要学习的目标单词;Definition 2: predict word w d+1 ;, indicating the target word that needs to be learned;

定义3:窗口单词,由文本中几个连续出现的单词构成,窗口单词之间存在隐藏的内在关联;Definition 3: Window words are composed of several words that appear consecutively in the text, and there are hidden internal associations between window words;

定义4:上下文短语(wd-l,wd-l+1,...,wd),表示预测单词所在位置之前出现的窗口单词,上下文短语长度为l;Definition 4: The context phrase (w dl ,w d-l+1 ,...,w d ) indicates the window word that appears before the location of the predicted word, and the length of the context phrase is l;

定义5:文档主题映射矩阵,通过LDA算法学习得到,每一行代表一个文档的主题;Definition 5: Document topic mapping matrix, learned by LDA algorithm, each row represents the topic of a document;

定义6:Nd和docid,Nd表示语料中文档的个数,docid表示文档的位置;每一个文档对应唯一的一个docid,其中,1≤docid≤NdDefinition 6: N d and doc id , N d represents the number of documents in the corpus, and doc id represents the location of the document; each document corresponds to a unique doc id , where 1≤doc id≤N d ;

步骤二、利用CNN,学习得到上下文短语的语义向量Context。Step 2. Use CNN to learn the semantic vector Context of the context phrase.

具体实现过程如下:The specific implementation process is as follows:

步骤2.1利用word2vec等算法,训练文档D的词向量矩阵,词向量矩阵大小为n×m,n表示词向量矩阵的长,m表词向量矩阵的宽;Step 2.1 Use word2vec and other algorithms to train the word vector matrix of document D. The size of the word vector matrix is n×m, where n represents the length of the word vector matrix, and m represents the width of the word vector matrix;

步骤2.2将上下文短语中每个单词对应的词向量从步骤2.1得到的词向量矩阵抽取出来,从而得到上下文短语wd-l,wd-l+1,...,wd的向量矩阵M;Step 2.2 extracts the word vector corresponding to each word in the context phrase from the word vector matrix obtained in step 2.1, so as to obtain the vector matrix M of the context phrase w dl , w d-l+1 ,...,w d ;

步骤2.3利用CNN计算上下文短语的语义向量Context。具体通过步骤2.2得到的向量矩阵M和K层大小为Cl×Cm的卷积核进行操作;Step 2.3 uses CNN to calculate the semantic vector Context of the context phrase. Specifically, the vector matrix M obtained in step 2.2 and the convolution kernel of the K layer with a size of C l × C m are operated;

其中,K表示卷积核的个数,本具体实施方式中K等于128,Cl表示卷积核的长,且Cl=l,Cm表示卷积核的宽,且Cm=m;Wherein, K represents the number of convolution kernels, K is equal to 128 in this specific embodiment, C 1 represents the length of the convolution kernel, and C 1 =1, C m represents the width of the convolution kernel, and C m =m;

上下文短语的语义向量Context通过公式(1)计算:The semantic vector Context of the context phrase is calculated by formula (1):

1≤k≤K1≤k≤K

Context=[Context1,Context2,...,ContextK]Context=[Context 1 ,Context 2 ,...,Context K ]

其中,Contextk表示上下文短语的语义向量的第k维,l表示上下文短语长度,m表示词向量矩阵的宽,即词向量维度,d表示上下文短语中第一个单词的起始位置,cpq是卷积核第p行和第q列的权重参数,Mpq表示向量矩阵M的第p行和第q列数据,b是卷积核的偏置参数;Among them, Context k represents the k-th dimension of the semantic vector of the context phrase, l represents the length of the context phrase, m represents the width of the word vector matrix, that is, the word vector dimension, d represents the starting position of the first word in the context phrase, c pq is the weight parameter of the pth row and qth column of the convolution kernel, M pq represents the pth row and qth column data of the vector matrix M, and b is the bias parameter of the convolution kernel;

步骤三、利用LSTM模型学习上下文短语的语义,获得隐含层向量hd-l,hd-l+1,...,hdStep 3: Use the LSTM model to learn the semantics of contextual phrases, and obtain hidden layer vectors h dl , h d-l+1 ,...,h d .

具体实现过程如下:The specific implementation process is as follows:

步骤3.1将t赋值d-l,即t=d-l,t表示第t时刻;Step 3.1 assigns t to d-l, that is, t=d-l, and t represents the tth moment;

步骤3.2将xt赋值wt的词向量,xt表示第t时刻输入的词向量,wt表示第t时刻输入的单词;Step 3.2 Assign x t to the word vector of w t , where x t represents the word vector input at the t-th moment, and w t represents the word input at the t-th moment;

其中,wt的词向量通过步骤2.1输出的词向量矩阵映射得到,即抽取wt在向量矩阵M对应位置的词向量;Among them, the word vector of w t is obtained by mapping the word vector matrix output in step 2.1, that is, extracting the word vector of w t in the corresponding position of the vector matrix M;

步骤3.3将xt作为LSTM模型的输入,获得t时刻的隐含层向量htStep 3.3 takes x t as the input of the LSTM model, and obtains the hidden layer vector h t at time t ;

具体实现过程如下:The specific implementation process is as follows:

步骤3.3.1计算t时刻的遗忘门ft,用于控制遗忘信息,通过公式(2)计算;Step 3.3.1 Calculate the forgetting gate f t at time t , which is used to control the forgotten information, and is calculated by formula (2);

ft=σ(Wfxt+Ufht-1+bf) (2)f t =σ(W f x t +U f h t-1 +b f ) (2)

其中,Wf表示参数矩阵,xt表示第t时刻输入的词向量,Uf表示参数矩阵,ht-1表示t-1时刻的隐含层向量,bf表示偏置向量参数,当t=d-l时,ht-1=hd-l-1,且hd-l-1为零向量,σ表示Sigmoid函数,是LSTM模型的激活函数;Among them, W f represents the parameter matrix, x t represents the word vector input at time t, U f represents the parameter matrix, h t-1 represents the hidden layer vector at time t-1, b f represents the bias vector parameter, when t = dl, h t-1 = h dl-1 , and h dl-1 is a zero vector, σ represents the Sigmoid function, which is the activation function of the LSTM model;

步骤3.3.2计算t时刻的输入门it,用于控制当前时刻需要添加的新信息,通过公式(3)计算;Step 3.3.2 Calculate the input gate i t at time t, which is used to control the new information that needs to be added at the current time, and is calculated by formula (3);

it=σ(Wixt+Uiht-1+bi) (3)i t =σ(W i x t +U i h t-1 +b i ) (3)

其中,Wi表示参数矩阵,xt表示第t时刻输入的词向量,Ui表示参数矩阵,ht-1表示t-1时刻的隐含层向量,bi表示偏置向量参数,σ表示Sigmoid函数,是LSTM模型的激活函数;Among them, W i represents the parameter matrix, x t represents the word vector input at time t, U i represents the parameter matrix, h t-1 represents the hidden layer vector at time t-1, bi represents the bias vector parameter, and σ represents The Sigmoid function is the activation function of the LSTM model;

步骤3.3.3计算t时刻更新的信息通过公式(4)计算;Step 3.3.3 Calculate the updated information at time t Calculated by formula (4);

其中,表示参数矩阵,xt表示第t时刻输入的词向量,表示参数矩阵,ht-1表示t-1时刻的隐含层向量,表示偏置向量参数,tanh表示双曲正切函数,是LSTM模型的激活函数;in, Represents the parameter matrix, x t represents the word vector input at the tth moment, Represents the parameter matrix, h t-1 represents the hidden layer vector at time t-1, Represents the bias vector parameter, tanh represents the hyperbolic tangent function, which is the activation function of the LSTM model;

步骤3.3.4计算t时刻的信息,将上一时刻的信息和当前时刻更新的信息相加得到,通过公式(5)计算;Step 3.3.4 calculates the information at time t, which is obtained by adding the information at the previous time and the updated information at the current time, and calculates by formula (5);

其中,ct表示t时刻的信息,ft表示t时刻遗忘门,ct-1表示t-1时刻的信息,it表示t时刻的输入门,表示t时刻更新的信息,°表示向量的叉乘;Among them, c t represents the information at time t, f t represents the forgetting gate at time t, c t-1 represents the information at time t-1, and it represents the input gate at time t , Indicates the updated information at time t, ° indicates the cross product of vectors;

步骤3.3.5计算t时刻的输出门ot,用于控制输入信息,通过公式(6)计算:Step 3.3.5 Calculate the output gate o t at time t , which is used to control the input information, calculated by formula (6):

ot=σ(Woxt+U0ht-1+bo) (6)o t =σ(W o x t +U 0 h t-1 +b o ) (6)

其中,Wo表示参数矩阵,xt表示第t时刻输入的词向量,U0表示参数矩阵,ht-1表示t-1时刻的隐含层向量,bo表示偏置向量参数,σ表示Sigmoid函数,是LSTM模型的激活函数;其中,步骤3.3.1-3.3.3和步骤3.3.5中的参数矩阵Wf,Uf,Wi,UiWo,Uo的矩阵元素大小不同,偏置向量参数bf,bibo中的元素大小不同;步骤3.3.6计算t时刻的隐含层向量ht,通过公式(7)计算:Among them, W o represents the parameter matrix, x t represents the word vector input at time t, U 0 represents the parameter matrix, h t-1 represents the hidden layer vector at time t-1, b o represents the bias vector parameter, σ represents The Sigmoid function is the activation function of the LSTM model; among them, the parameter matrix W f , U f , W i , U i in steps 3.3.1-3.3.3 and step 3.3.5, The matrix elements of W o and U o have different sizes, and the bias vector parameters b f , b i , The size of the elements in b o is different; step 3.3.6 calculates the hidden layer vector h t at time t , calculated by formula (7):

ht=otοct (7)h t =o t οc t (7)

其中,ot表示t时刻的输出门,ct表示t时刻的信息;Among them, o t represents the output gate at time t, and c t represents the information at time t;

步骤3.4判断t是否等于d,若不等于则t加1,跳步骤3.2;若等于,则输出隐含层向量hd-l,hd-l+1,...,hd,跳入步骤四;Step 3.4 Determine whether t is equal to d, if not, add 1 to t, and skip to step 3.2; if it is equal, output hidden layer vectors h dl , h d-l+1 ,...,h d , and jump to step 4 ;

步骤四、利用注意力机制,将CNN和LSTM模型结合,获得上下文短语语义向量的平均值具体实现过程如下:Step 4. Use the attention mechanism to combine the CNN and LSTM models to obtain the average value of the context phrase semantic vector The specific implementation process is as follows:

步骤4.1利用步骤二得到的上下文短语语义向量,通过注意力机制得到每个单词在上下文短语的语义向量上的重要性因子α,具体通过公式(8)计算:Step 4.1 Use the semantic vector of the context phrase obtained in step 2 to obtain the importance factor α of each word on the semantic vector of the context phrase through the attention mechanism, specifically calculated by formula (8):

d-l≤t≤dd-l≤t≤d

α=[αd-ld-l+1,...,αd] (8)α=[α dld-l+1 ,...,α d ] (8)

其中,αt表示t时刻单词在上下文短语的语义向量上的重要性因子,Context表示步骤二中获得的上下文短语的语义向量,xt表示第t时刻输入的词向量,xi表示第i时刻输入的词向量;T表示向量的转置;e表示以e,即自然常数为底的指数函数;Among them, α t represents the importance factor of the word on the semantic vector of the context phrase at time t, Context represents the semantic vector of the context phrase obtained in step 2, x t represents the word vector input at time t, and x i represents time i The input word vector; T represents the transposition of the vector; e represents an exponential function based on e, which is a natural constant;

步骤4.2计算基于注意力机制带权重的隐含层向量h′,通过公式(9)计算;Step 4.2 calculates the weighted hidden layer vector h' based on the attention mechanism, which is calculated by formula (9);

h′t=αt*ht h′ tt *h t

d-l≤t≤dd-l≤t≤d

h′=[h′d-l,h′d-l+1,...,h′d] (9)h'=[h' dl ,h' d-l+1 ,...,h' d ] (9)

其中,h′t表示t时刻权重隐含层向量h′t,αt表示t时刻每个单词在上下文短语的语义向量上的重要性因子,ht表示t时刻隐含层向量;Among them, h′ t represents the weight hidden layer vector h′ t at time t , α t represents the importance factor of each word on the semantic vector of the context phrase at time t, and h t represents the hidden layer vector at time t;

步骤4.3利用mean-pooling操作,计算上下文短语语义向量的平均值通过公式(10)计算:Step 4.3 uses the mean-pooling operation to calculate the average value of the context phrase semantic vector Calculated by formula (10):

其中,h′t表示t时刻权重隐含层向量h′tAmong them, h′ t represents the weight hidden layer vector h′ t at time t ;

步骤五、通过逻辑回归的方法,利用上下文短语语义向量的平均值和文档主题信息预测目标单词wd+1,获得目标单词wd+1的预测概率。具体实现过程如下:Step 5: Predict the target word w d+1 by using the mean value of the semantic vector of the context phrase and the topic information of the document by means of logistic regression, and obtain the predicted probability of the target word w d+1 . The specific implementation process is as follows:

步骤5.1利用LDA算法学习文档主题映射矩阵,然后根据文档主题映射矩阵和docid将每一个文档映射成一个长度和步骤2.1中词向量矩阵宽度相等的-维向量DzStep 5.1 uses the LDA algorithm to learn the document topic mapping matrix, and then maps each document into a -dimensional vector D z whose length is equal to the width of the word vector matrix in step 2.1 according to the document topic mapping matrix and doc id ;

步骤5.2将步骤5.1输出的向量Dz和步骤四输出的上下文短语语义向量的平均值拼接起来,得到拼接向量Vd Step 5.2 takes the average value of the vector D z output in step 5.1 and the context phrase semantic vector output in step 4 spliced together to get the spliced vector V d ,

步骤5.3利用步骤5.2输出的Vd来预测目标单词wd+1。具体通过逻辑回归的方法进行分类,目标函数如公式(11)Step 5.3 uses V d output from step 5.2 to predict the target word w d+1 . Specifically, it is classified by the method of logistic regression, and the objective function is as formula (11)

其中,θd+1是目标单词wd+1所在位置对应的参数,θi对应词表中单词wi对应的参数,|V|表示词表的大小,Vd是步骤5.2得到的拼接向量,exp表示以e为底的指数函数,Σ表示求和;P表示概率,y表示因变量,T表示矩阵转置。Among them, θ d+1 is the parameter corresponding to the position of the target word w d+1 , θ i corresponds to the parameter corresponding to the word w i in the vocabulary, |V| represents the size of the vocabulary, and V d is the splicing vector obtained in step 5.2 , exp represents the exponential function with e as the base, Σ represents the summation; P represents the probability, y represents the dependent variable, and T represents the matrix transposition.

步骤5.4利用交叉熵的方法,通过公式(12)计算目标函数(11)的损失函数:Step 5.4 uses the method of cross entropy to calculate the loss function of the objective function (11) by formula (12):

L=-log(P(y=wd+1|Vd)) (12)L=-log(P(y=w d+1 |V d )) (12)

其中,wd+1表示目标单词,Vd是步骤4.2的拼接向量,log()表示以10为底的对数函数;Among them, w d+1 represents the target word, V d is the splicing vector of step 4.2, and log() represents a logarithmic function with base 10;

损失函数(12)通过Sampled Softmax算法和小批量随机梯度下降参数更新方法进行更新求解,得到文档主题向量。The loss function (12) is updated and solved by the Sampled Softmax algorithm and the mini-batch stochastic gradient descent parameter update method to obtain the document topic vector.

至此,从步骤一到步骤五,完成了具有深层语义和显著的文档主题向量。So far, from step 1 to step 5, the document topic vector with deep semantics and salience is completed.

实施例Example

本实施例叙述了本发明的具体实施过程,如图1所示。This embodiment describes the specific implementation process of the present invention, as shown in FIG. 1 .

从图1可以看出,本发明一种基于深度学习的文档主题向量抽取方法的流程如下:As can be seen from Figure 1, the process of a method for extracting document topic vectors based on deep learning in the present invention is as follows:

步骤A预处理;首先去除掉语料中的无意义符号,如特殊字符等,然后对文本进行分词。分词就是将连续的文字序列按照既定的词法规则分割成单独的词语的过程,从而将句子分解为若干个连续的有意义的单词串用于后续分析。分词操作利用PTB分词器进行分词处理。在分词之后,对原始的文本构建词表,在本实施例中,词表选取的是训练文本的词频从高到底前20000个单词,也就是词表V的大小为20000。在词表选取之后,按照词表的索引构建出原始语料的词表索引数据,这个文本词表索引数据作为模型的输入。Step A preprocessing; first remove meaningless symbols in the corpus, such as special characters, etc., and then segment the text. Word segmentation is the process of dividing a continuous sequence of words into separate words according to established lexical rules, thereby decomposing a sentence into several continuous meaningful word strings for subsequent analysis. The word segmentation operation uses the PTB tokenizer for word segmentation processing. After the word segmentation, a vocabulary is constructed for the original text. In this embodiment, the vocabulary is selected from the top 20,000 words in the training text from high to low, that is, the size of the vocabulary V is 20,000. After the vocabulary is selected, the vocabulary index data of the original corpus is constructed according to the vocabulary index, and this text vocabulary index data is used as the input of the model.

步骤B利用word2vec算法学习词向量。将文档中单词输入到word2vec算法中,得到词向量,其目标函数如公式(13):Step B uses the word2vec algorithm to learn word vectors. Input the words in the document into the word2vec algorithm to get the word vector, and its objective function is as formula (13):

其中,k为窗口单词,i为当前单词,Corp为语料库中单词大小,利用梯度下降方法学习得到128维的词向量;Among them, k is the window word, i is the current word, Corp is the size of the word in the corpus, and the 128-dimensional word vector is learned by using the gradient descent method;

步骤C利用CNN抽取上下文短语语义向量,利用RNN学习上下文短语隐含层向量;Step C uses CNN to extract contextual phrase semantic vectors, and uses RNN to learn contextual phrase hidden layer vectors;

其中,利用CNN抽取上下文短语语义向量,利用RNN学习上下文短语隐含层向量是并列计算的,具体到本实施例:Wherein, using CNN to extract contextual phrase semantic vectors, using RNN to learn contextual phrase hidden layer vectors are calculated in parallel, specific to this embodiment:

利用CNN抽取上下文短语语义向量;首先利用高斯分布进行随机初始化一个K层的大小为Cl×Cm的卷积核,对于给定的上下文短语wd-l,wd-l+1,...,wd,通过步骤B学到的词向量将这些上下文短语映射成大小为l×m的矩阵,其中,l是上下文短语的长度,m为词向量的维度,将该矩阵在随机初始化的卷积核上进行卷积操作,具体操作方式如公式(1)所示,这样就得到了一个向量Context,该向量就是上下文短语的语义向量;Use CNN to extract contextual phrase semantic vector; first use Gaussian distribution to randomly initialize a K-layer convolution kernel with a size of C l ×C m , for a given contextual phrase w dl ,w d-l+1 ,... ,w d , through the word vector learned in step B, these context phrases are mapped into a matrix of size l×m, where l is the length of the context phrase, m is the dimension of the word vector, and the matrix is randomly initialized in the volume The convolution operation is performed on the product kernel, and the specific operation method is shown in formula (1), so that a vector Context is obtained, which is the semantic vector of the context phrase;

利用RNN学习上下文短语隐含层向量;将上下文短语wd-l,wd-l+1,...,wd对应的词向量按顺序输入到LSTM模型中,将0时刻的隐含层向量h0的每一个维度设置为0,然后利用公式(2)-(7)依次计算遗忘门,输入门、输出门和最终结果上下文短语隐含层向量,维度大小设置为128;Use RNN to learn the hidden layer vector of contextual phrases; input the word vectors corresponding to contextual phrases w dl ,w d-l+1 ,...,w d into the LSTM model in order, and the hidden layer vector h at time 0 Each dimension of 0 is set to 0, and then the forgetting gate, input gate, output gate and final result context phrase hidden layer vector are calculated sequentially using formulas (2)-(7), and the dimension size is set to 128;

步骤D利用注意力机制计算带权重的语义向量、计算文档主题分布;Step D uses the attention mechanism to calculate the weighted semantic vector and calculate the topic distribution of the document;

其中,利用注意力机制计算带权重的语义向量、计算文档主题分布是并列计算的,具体到本实施例:Among them, the use of the attention mechanism to calculate the weighted semantic vector and the calculation of the topic distribution of the document are calculated in parallel, specific to this embodiment:

利用注意力机制计算带权重的语义向量;根据步骤B得到的词向量和步骤C得到的上下文短语的语义向量,对于上下文短语中的每一个单词进行注意力机制操作,求得注意力因子αt,αt是一个在0到1之间的实数,其数字越大,那么其对应位置的词向量信息就会越多的保留在最后的mean-pooling层中,因此其大小表示了当前词语在表征整个短语意义的重要性,也就是说越重要的词语将会得到更多的注意;Use the attention mechanism to calculate the weighted semantic vector; according to the word vector obtained in step B and the semantic vector of the context phrase obtained in step C, perform an attention mechanism operation on each word in the context phrase to obtain the attention factor α t , α t is a real number between 0 and 1. The larger the number, the more word vector information in its corresponding position will be retained in the last mean-pooling layer, so its size represents the current word in The importance of representing the meaning of the entire phrase, that is, the more important words will receive more attention;

计算文档主题分布;具体是利用LDA算法计算,首先将文档D输入到LDA算法中,得到每一个文档D的主题分布,该主题分布直接作为最终结果,记为DzCalculate the document topic distribution; specifically, use the LDA algorithm to calculate, first input the document D into the LDA algorithm, and obtain the topic distribution of each document D, and the topic distribution is directly used as the final result, which is recorded as D z ;

步骤E预测目标单词,学习文档主题向量;将带权重的语义向量和Dz直接拼接起来,然后使得目标单词出现的概率最大,由Sampled Softmax算法和小批量随机梯度下降参数更新方法即可求得文档主题向量。Step E predicts the target word and learns the document topic vector; directly splices the weighted semantic vector and Dz , and then maximizes the probability of the target word, which can be obtained by the Sampled Softmax algorithm and the small-batch stochastic gradient descent parameter update method Document topic vector.

Claims (5)

1. A document theme vector extraction method based on deep learning is characterized by comprising the following steps:
step one, performing relevant definition, specifically as follows:
definition 1: document D, D ═ w1,w2,...,wi,...,wn],wiThe ith word representing document D;
definition 2: predicting word wd+1(ii) a Representing a target word to be learned;
definition 3: window words which are formed by words continuously appearing in the text, and hidden internal association exists between the window words;
definition 4: the contextual phrase: w is ad-l,wd-l+1,...,wdThe word is a window word appearing before the position of the predicted word, and the length of the context phrase is l;
definition 5: the document theme mapping matrix is obtained by learning of an LDA algorithm, and each row represents a theme of a document;
definition 6: n is a radical ofdAnd docid,NdRepresenting the number of documents in the corpus, docidRepresenting a location of a document; each document corresponds to a unique docidWherein, 1 is less than or equal to docid≤Nd
Learning to obtain a semantic vector of the context phrase by using a Convolutional Neural Network (CNN);
thirdly, learning the semantics of the context phrase by utilizing the long-short term memory network model LSTM to obtain a hidden layer vector hd-l,hd-l+1,...,hd(ii) a The specific implementation process is as follows:
step 3.1, assigning t to d-l, namely t is d-l, and t represents the t-th moment;
step 3.2 reaction of xtAssignment wtWord vector of xtRepresenting the word vector entered at time t, wtA word indicating input at the t-th time;
wherein, wtThe word vectors are obtained by mapping the word vector matrix output in step 2.1, i.e. extracting wtWord vectors at the corresponding positions of the vector matrix M;
step 3.3 reaction of xtObtaining a hidden layer vector h at the time t as an input of an LSTM modelt
Step 3.4, judging whether t is equal to d, if not, adding 1 to t, and skipping to step 3.2; if yes, the hidden layer vector h is outputd-l,hd-l+1,...,hdJumping to the step four;
step four, organically combining the CNN model and the LSTM model through an attention mechanism to obtain the average value of the semantic vector of the context phraseThe specific implementation method comprises the following steps:
step 4.1, obtaining an importance factor α of each word on the semantic vector of the context phrase through an attention mechanism by using the context phrase semantic vector obtained in the step two, and specifically calculating through the following formula:
d-l≤t≤d
α=[αd-ld-l+1,...,αd]
wherein alpha istRepresenting the importance factor of the word at the time t on the semantic vector of the Context phrase, Context representing the semantic vector of the Context phrase obtained in the step two, xtRepresenting the word vector, x, input at time tiRepresenting the word vector input at the ith moment; t represents the transposition of the vector; e represents an exponential function with e, a natural constant, as the base;
step 4.2, calculating a hidden layer vector h' with weight based on attention mechanism by the following formula;
ht′=αt*ht
d-l≤t≤d
h′=[h′d-l,h′d-l+1,...,hd′]
wherein h ist' denotes the weighted hidden layer vector h at time tt′,αtAn importance factor, h, representing each word at time t on the semantic vector of the context phrasetRepresenting a hidden layer vector at the moment t;
step 4.3 calculate the mean of the semantic vector of the context phrase using mean-posing operationCalculated by the following equation (10):
wherein h ist' denotes the weighted hidden layer vector h at time tt′;
Step five, utilizing the average value of the semantic vector of the context phrase by a logistic regression methodAnd the document topic information prediction target word wd+1Obtaining the target word wd+1The prediction probability of (2).
2. The method for extracting document theme vectors based on deep learning of claim 1, wherein the second step is implemented by the following steps:
step 2.1, training a word vector matrix of the document D, wherein the size of the word vector matrix is n multiplied by m, n represents the length of the word vector matrix, and m represents the width of the word vector matrix;
step 2.2, extracting the word vector corresponding to each word in the context phrase from the word vector matrix obtained in step 2.1 to obtain the context phrase wd-l,wd-l+1,...,wdA vector matrix M of (a);
step 2.3, calculating the semantic vector Context of the Context phrase by using the CNN, wherein the vector matrix M and the K layers obtained in the step 2.2 have the size of Cl×CmThe convolution kernel of (a) operates;
where K denotes the number of convolution kernels, ClRepresents the length of the convolution kernel, and Cl=l,CmRepresents the width of the convolution kernel, and Cm=m;
The semantic vector Context of a Context phrase is calculated by equation (1):
1≤k≤K
Context=[Context1,Context2,...,ContextK]
wherein, ContextkIndicating short contextThe kth dimension of the semantic vector of a word, l represents the length of the context phrase, m represents the width of the word vector matrix, i.e., the word vector dimension, d represents the starting position of the first word in the context phrase, cpqIs the weight parameter of the p-th row and q-th column of the convolution kernel, MpqLine p and column q data representing the vector matrix M, b is the bias parameter for the convolution kernel.
3. The method for extracting document theme vector based on deep learning of claim 1, wherein the specific implementation method of the step 3.3 is as follows:
step 3.3.1 calculating forgetting door f at t momenttThe forgetting control module is used for controlling forgetting information and calculating through a formula (2);
ft=σ(Wfxt+Ufht-1+bf) (2)
wherein, WfRepresenting a parameter matrix, xtRepresenting the word vector, U, input at time tfRepresents a parameter matrix, ht-1Representing the hidden layer vector at time t-1, bfDenotes the offset vector parameter, when t is d-l, ht-1=hd-l-1And h isd-l-1Is a zero vector, sigma represents a Sigmoid function, and is an activation function of the LSTM model;
step 3.3.2 input Gate i at time ttThe new information to be added at the current moment is controlled and calculated through a formula (3);
it=σ(Wixt+Uiht-1+bi) (3)
wherein, WiRepresenting a parameter matrix, xtRepresenting the word vector, U, input at time tiRepresents a parameter matrix, ht-1Representing the hidden layer vector at time t-1, biRepresenting a bias vector parameter, and sigma representing a Sigmoid function, which is an activation function of an LSTM model;
step 3.3.3 calculating updated information at time tBy the formula (4) Calculating;
wherein,representing a parameter matrix, xtRepresenting the word vector entered at time t,represents a parameter matrix, ht-1Representing the hidden layer vector at time t-1,representing a bias vector parameter, and tanh representing a hyperbolic tangent function, which is an activation function of an LSTM model;
step 3.3.4, calculating the information of the t moment, adding the information of the previous moment and the updated information of the current moment to obtain the information, and calculating the information through a formula (5);
wherein, ctInformation indicating time t, ftIndicating forgetting to leave door at time t, ct-1Information indicating the time t-1, itThe input gate at time t is shown,information indicating the update at time t is provided,represents a cross product of the vectors;
step 3.3.5 output gate o at time t is calculatedtFor controlling the input information, calculated by equation (6):
ot=σ(Woxt+U0ht-1+bo) (6)
wherein, WoRepresenting a parameter matrix, xtRepresenting the word vector, U, input at time t0Represents a parameter matrix, ht-1Representing the hidden layer vector at time t-1, boRepresenting a bias vector parameter, and sigma representing a Sigmoid function, which is an activation function of an LSTM model; wherein the parameter matrix Wf,Uf,Wi,UiWo,UoHas different matrix element sizes, and is biased with vector parameter bf,biboThe elements in (A) are different in size;
step 3.3.6 calculating the hidden layer vector h at time ttCalculated by equation (7):
wherein o istOutput gate representing time t, ctInformation indicating time t.
4. The method for extracting document theme vectors based on deep learning of claim 1, wherein the concrete implementation method of the fifth step is as follows:
step 5.1 learning the document theme mapping matrix, then according to the document theme mapping matrix and docidEach document is mapped into a-dimensional vector D of equal length and width of the word vector matrix in step 2.1z
Step 5.2 vector D output from step 5.1zAnd the average value of the context phrase semantic vector output in the step fourSplicing to obtain a splicing vector Vd
Step 5.3 uses the V output from step 5.2dTo predict the target word wd+1
Step 5.4, calculating a loss function of the target function (11) by using a cross entropy method through a formula (12):
L=-log(P(y=wd+1|Vd)) (12)
wherein, wd+1Representing a target word, VdIs the concatenation vector of step 4.2, log () represents a base-10 logarithmic function;
and (3) updating and solving the loss function (12) by a Sampled Softmax algorithm and a small batch random gradient descent parameter updating method to obtain a document theme vector.
5. The method for extracting document theme vector based on deep learning as claimed in claim 4, wherein in the step 5.3, classification is performed by a logistic regression method, and the objective function is as the formula (11)
Wherein, thetad+1Is the target word wd+1The parameter, theta, corresponding to the locationiWord w in the corresponding word listiCorresponding parameters, | V | representing the size of the vocabulary, VdThe splicing vector obtained in the step 5.2 is used, exp represents an exponential function with e as a base, and sigma represents summation; p denotes probability, y denotes dependent variable, and T denotes matrix transposition.
CN201810748564.1A 2018-07-10 2018-07-10 A deep learning-based document topic vector extraction method Active CN108984526B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810748564.1A CN108984526B (en) 2018-07-10 2018-07-10 A deep learning-based document topic vector extraction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810748564.1A CN108984526B (en) 2018-07-10 2018-07-10 A deep learning-based document topic vector extraction method

Publications (2)

Publication Number Publication Date
CN108984526A true CN108984526A (en) 2018-12-11
CN108984526B CN108984526B (en) 2021-05-07

Family

ID=64536620

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810748564.1A Active CN108984526B (en) 2018-07-10 2018-07-10 A deep learning-based document topic vector extraction method

Country Status (1)

Country Link
CN (1) CN108984526B (en)

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109871532A (en) * 2019-01-04 2019-06-11 平安科技(深圳)有限公司 Text subject extracting method, device and storage medium
CN109933804A (en) * 2019-03-27 2019-06-25 北京信息科技大学 A Keyword Extraction Method by Fusing Topic Information and Bidirectional LSTM
CN109960802A (en) * 2019-03-19 2019-07-02 四川大学 Information processing method and device for narrative text of aviation safety report
CN110083710A (en) * 2019-04-30 2019-08-02 北京工业大学 It is a kind of that generation method is defined based on Recognition with Recurrent Neural Network and the word of latent variable structure
CN110263343A (en) * 2019-06-24 2019-09-20 北京理工大学 The keyword abstraction method and system of phrase-based vector
CN110334358A (en) * 2019-04-28 2019-10-15 厦门大学 A Context-aware Phrase Representation Learning Method
CN110378409A (en) * 2019-07-15 2019-10-25 昆明理工大学 It is a kind of based on element association attention mechanism the Chinese get over news documents abstraction generating method
CN110457674A (en) * 2019-06-25 2019-11-15 西安电子科技大学 A Topic-Guided Text Prediction Method
CN110472047A (en) * 2019-07-15 2019-11-19 昆明理工大学 A kind of Chinese of multiple features fusion gets over news viewpoint sentence abstracting method
CN110532395A (en) * 2019-05-13 2019-12-03 南京大学 A kind of method for building up of the term vector improved model based on semantic embedding
CN110766073A (en) * 2019-10-22 2020-02-07 湖南科技大学 Mobile application classification method for strengthening topic attention mechanism
CN110781256A (en) * 2019-08-30 2020-02-11 腾讯大地通途(北京)科技有限公司 Method and device for determining POI (Point of interest) matched with Wi-Fi (Wireless Fidelity) based on transmitted position data
CN110825848A (en) * 2019-06-10 2020-02-21 北京理工大学 Text classification method based on phrase vectors
CN111125434A (en) * 2019-11-26 2020-05-08 北京理工大学 A method and system for relation extraction based on ensemble learning
CN111414483A (en) * 2019-01-04 2020-07-14 阿里巴巴集团控股有限公司 Document processing device and method
CN111696624A (en) * 2020-06-08 2020-09-22 天津大学 DNA binding protein identification and function annotation deep learning method based on self-attention mechanism
CN111753540A (en) * 2020-06-24 2020-10-09 云南电网有限责任公司信息中心 Method and system for collecting text data to perform Natural Language Processing (NLP)
CN112597311A (en) * 2020-12-28 2021-04-02 东方红卫星移动通信有限公司 Terminal information classification method and system based on low-earth-orbit satellite communication
CN112632966A (en) * 2020-12-30 2021-04-09 绿盟科技集团股份有限公司 Alarm information marking method, device, medium and equipment
CN112685538A (en) * 2020-12-30 2021-04-20 北京理工大学 Text vector retrieval method combined with external knowledge
CN112699662A (en) * 2020-12-31 2021-04-23 太原理工大学 False information early detection method based on text structure algorithm
CN112966551A (en) * 2021-01-29 2021-06-15 湖南科技学院 Method and device for acquiring video frame description information and electronic equipment
WO2021155705A1 (en) * 2020-02-06 2021-08-12 支付宝(杭州)信息技术有限公司 Text prediction model training method and apparatus
CN115763167A (en) * 2022-11-22 2023-03-07 黄华集团有限公司 Solid cabinet breaker and control method thereof

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105069143A (en) * 2015-08-19 2015-11-18 百度在线网络技术(北京)有限公司 Method and device for extracting keywords from document
CN106547735A (en) * 2016-10-25 2017-03-29 复旦大学 The structure and using method of the dynamic word or word vector based on the context-aware of deep learning
CN106909537A (en) * 2017-02-07 2017-06-30 中山大学 A kind of polysemy analysis method based on topic model and vector space
CN106919557A (en) * 2017-02-22 2017-07-04 中山大学 A kind of document vector generation method of combination topic model
CN107092596A (en) * 2017-04-24 2017-08-25 重庆邮电大学 Text emotion analysis method based on attention CNNs and CCR
CN107423282A (en) * 2017-05-24 2017-12-01 南京大学 Semantic Coherence Sexual Themes and the concurrent extracting method of term vector in text based on composite character
CN107562792A (en) * 2017-07-31 2018-01-09 同济大学 A kind of question and answer matching process based on deep learning

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105069143A (en) * 2015-08-19 2015-11-18 百度在线网络技术(北京)有限公司 Method and device for extracting keywords from document
CN106547735A (en) * 2016-10-25 2017-03-29 复旦大学 The structure and using method of the dynamic word or word vector based on the context-aware of deep learning
CN106909537A (en) * 2017-02-07 2017-06-30 中山大学 A kind of polysemy analysis method based on topic model and vector space
CN106919557A (en) * 2017-02-22 2017-07-04 中山大学 A kind of document vector generation method of combination topic model
CN107092596A (en) * 2017-04-24 2017-08-25 重庆邮电大学 Text emotion analysis method based on attention CNNs and CCR
CN107423282A (en) * 2017-05-24 2017-12-01 南京大学 Semantic Coherence Sexual Themes and the concurrent extracting method of term vector in text based on composite character
CN107562792A (en) * 2017-07-31 2018-01-09 同济大学 A kind of question and answer matching process based on deep learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
GUANGXU XUN 等: "Topic Discovery for Short Texts Using Word Embeddings", 《2016 IEEE 16TH INTERNATIONAL CONFERENCE ON DATA MINING》 *
胡朝举 等: "基于词向量技术和混合神经网络的情感分析", 《计算机应用研究》 *

Cited By (41)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109871532A (en) * 2019-01-04 2019-06-11 平安科技(深圳)有限公司 Text subject extracting method, device and storage medium
CN111414483B (en) * 2019-01-04 2023-03-28 阿里巴巴集团控股有限公司 Document processing device and method
CN111414483A (en) * 2019-01-04 2020-07-14 阿里巴巴集团控股有限公司 Document processing device and method
WO2020140633A1 (en) * 2019-01-04 2020-07-09 平安科技(深圳)有限公司 Text topic extraction method, apparatus, electronic device, and storage medium
CN109960802A (en) * 2019-03-19 2019-07-02 四川大学 Information processing method and device for narrative text of aviation safety report
CN109933804A (en) * 2019-03-27 2019-06-25 北京信息科技大学 A Keyword Extraction Method by Fusing Topic Information and Bidirectional LSTM
CN110334358A (en) * 2019-04-28 2019-10-15 厦门大学 A Context-aware Phrase Representation Learning Method
CN110083710A (en) * 2019-04-30 2019-08-02 北京工业大学 It is a kind of that generation method is defined based on Recognition with Recurrent Neural Network and the word of latent variable structure
CN110083710B (en) * 2019-04-30 2021-04-02 北京工业大学 A Word Definition Generation Method Based on Recurrent Neural Network and Latent Variable Structure
CN110532395B (en) * 2019-05-13 2021-09-28 南京大学 Semantic embedding-based word vector improvement model establishing method
CN110532395A (en) * 2019-05-13 2019-12-03 南京大学 A kind of method for building up of the term vector improved model based on semantic embedding
CN110825848A (en) * 2019-06-10 2020-02-21 北京理工大学 Text classification method based on phrase vectors
CN110825848B (en) * 2019-06-10 2022-08-09 北京理工大学 Text classification method based on phrase vectors
CN110263343B (en) * 2019-06-24 2021-06-15 北京理工大学 Method and system for keyword extraction based on phrase vector
CN110263343A (en) * 2019-06-24 2019-09-20 北京理工大学 The keyword abstraction method and system of phrase-based vector
CN110457674A (en) * 2019-06-25 2019-11-15 西安电子科技大学 A Topic-Guided Text Prediction Method
CN110378409A (en) * 2019-07-15 2019-10-25 昆明理工大学 It is a kind of based on element association attention mechanism the Chinese get over news documents abstraction generating method
CN110472047B (en) * 2019-07-15 2022-12-13 昆明理工大学 A Chinese-Vietnamese News Opinion Sentence Extraction Method Based on Multi-Feature Fusion
CN110472047A (en) * 2019-07-15 2019-11-19 昆明理工大学 A kind of Chinese of multiple features fusion gets over news viewpoint sentence abstracting method
CN110781256A (en) * 2019-08-30 2020-02-11 腾讯大地通途(北京)科技有限公司 Method and device for determining POI (Point of interest) matched with Wi-Fi (Wireless Fidelity) based on transmitted position data
CN110781256B (en) * 2019-08-30 2024-02-23 腾讯大地通途(北京)科技有限公司 Method and device for determining POI matched with Wi-Fi based on sending position data
CN110766073B (en) * 2019-10-22 2023-10-27 湖南科技大学 Mobile application classification method for strengthening topic attention mechanism
CN110766073A (en) * 2019-10-22 2020-02-07 湖南科技大学 Mobile application classification method for strengthening topic attention mechanism
CN111125434B (en) * 2019-11-26 2023-06-27 北京理工大学 A method and system for relation extraction based on ensemble learning
CN111125434A (en) * 2019-11-26 2020-05-08 北京理工大学 A method and system for relation extraction based on ensemble learning
WO2021155705A1 (en) * 2020-02-06 2021-08-12 支付宝(杭州)信息技术有限公司 Text prediction model training method and apparatus
CN111696624A (en) * 2020-06-08 2020-09-22 天津大学 DNA binding protein identification and function annotation deep learning method based on self-attention mechanism
CN111696624B (en) * 2020-06-08 2022-07-12 天津大学 Deep learning method for DNA-binding protein identification and functional annotation based on self-attention mechanism
CN111753540B (en) * 2020-06-24 2023-04-07 云南电网有限责任公司信息中心 Method and system for collecting text data to perform Natural Language Processing (NLP)
CN111753540A (en) * 2020-06-24 2020-10-09 云南电网有限责任公司信息中心 Method and system for collecting text data to perform Natural Language Processing (NLP)
CN112597311B (en) * 2020-12-28 2023-07-11 东方红卫星移动通信有限公司 Terminal information classification method and system based on low-orbit satellite communication
CN112597311A (en) * 2020-12-28 2021-04-02 东方红卫星移动通信有限公司 Terminal information classification method and system based on low-earth-orbit satellite communication
CN112632966A (en) * 2020-12-30 2021-04-09 绿盟科技集团股份有限公司 Alarm information marking method, device, medium and equipment
CN112685538B (en) * 2020-12-30 2022-10-14 北京理工大学 Text vector retrieval method combined with external knowledge
CN112632966B (en) * 2020-12-30 2023-07-21 绿盟科技集团股份有限公司 Alarm information marking method, device, medium and equipment
CN112685538A (en) * 2020-12-30 2021-04-20 北京理工大学 Text vector retrieval method combined with external knowledge
CN112699662B (en) * 2020-12-31 2022-08-16 太原理工大学 False information early detection method based on text structure algorithm
CN112699662A (en) * 2020-12-31 2021-04-23 太原理工大学 False information early detection method based on text structure algorithm
CN112966551A (en) * 2021-01-29 2021-06-15 湖南科技学院 Method and device for acquiring video frame description information and electronic equipment
CN115763167A (en) * 2022-11-22 2023-03-07 黄华集团有限公司 Solid cabinet breaker and control method thereof
CN115763167B (en) * 2022-11-22 2023-09-22 黄华集团有限公司 Solid cabinet circuit breaker and control method thereof

Also Published As

Publication number Publication date
CN108984526B (en) 2021-05-07

Similar Documents

Publication Publication Date Title
CN108984526B (en) A deep learning-based document topic vector extraction method
CN111967266B (en) Chinese named entity recognition system, model construction method, application and related equipment
CN110866117B (en) Short text classification method based on semantic enhancement and multi-level label embedding
CN109977416B (en) Multi-level natural language anti-spam text method and system
CN111368996B (en) Retraining projection network capable of transmitting natural language representation
WO2022022163A1 (en) Text classification model training method, device, apparatus, and storage medium
Wang et al. Application of convolutional neural network in natural language processing
CN111738003B (en) Named entity recognition model training method, named entity recognition method and medium
CN111046179B (en) A text classification method for open network questions in specific domains
CN109325231B (en) A method for generating word vectors by a multi-task model
CN104834747B (en) Short text classification method based on convolutional neural networks
CN111930942B (en) Text classification method, language model training method, device and equipment
CN111291556B (en) Chinese entity relation extraction method based on character and word feature fusion of entity meaning item
CN110765775A (en) A Domain Adaptation Method for Named Entity Recognition Fusing Semantics and Label Differences
CN109977199B (en) A reading comprehension method based on attention pooling mechanism
CN112101041A (en) Entity relationship extraction method, device, equipment and medium based on semantic similarity
CN113392209B (en) Text clustering method based on artificial intelligence, related equipment and storage medium
Zhang et al. Ynu-hpcc at semeval-2018 task 1: Bilstm with attention based sentiment analysis for affect in tweets
CN108038492A (en) A kind of perceptual term vector and sensibility classification method based on deep learning
Chen et al. Customizable text generation via conditional text generative adversarial network
CN111159405B (en) Irony detection method based on background knowledge
CN113988079A (en) A dynamic enhanced multi-hop text reading recognition processing method for low data
CN111145914A (en) Method and device for determining lung cancer clinical disease library text entity
CN113051886A (en) Test question duplicate checking method and device, storage medium and equipment
CN111061873B (en) A Multi-Channel Text Classification Method Based on Attention Mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant