CN108984526A - A kind of document subject matter vector abstracting method based on deep learning - Google Patents
A kind of document subject matter vector abstracting method based on deep learning Download PDFInfo
- Publication number
- CN108984526A CN108984526A CN201810748564.1A CN201810748564A CN108984526A CN 108984526 A CN108984526 A CN 108984526A CN 201810748564 A CN201810748564 A CN 201810748564A CN 108984526 A CN108984526 A CN 108984526A
- Authority
- CN
- China
- Prior art keywords
- vector
- word
- representing
- time
- context
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 239000013598 vector Substances 0.000 title claims abstract description 183
- 238000000034 method Methods 0.000 title claims abstract description 38
- 238000013135 deep learning Methods 0.000 title claims abstract description 18
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 21
- 230000007246 mechanism Effects 0.000 claims abstract description 13
- 239000011159 matrix material Substances 0.000 claims description 55
- 230000006870 function Effects 0.000 claims description 31
- 238000000605 extraction Methods 0.000 claims description 9
- 238000013507 mapping Methods 0.000 claims description 9
- 230000004913 activation Effects 0.000 claims description 8
- 230000008569 process Effects 0.000 claims description 8
- 238000012549 training Methods 0.000 claims description 4
- 230000017105 transposition Effects 0.000 claims description 4
- 238000007477 logistic regression Methods 0.000 claims description 3
- 230000001419 dependent effect Effects 0.000 claims description 2
- 238000003058 natural language processing Methods 0.000 abstract description 6
- 239000000284 extract Substances 0.000 abstract 2
- 238000004458 analytical method Methods 0.000 description 8
- 238000004364 calculation method Methods 0.000 description 5
- 230000011218 segmentation Effects 0.000 description 5
- 230000007547 defect Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000011478 gradient descent method Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/258—Heading extraction; Automatic titling; Numbering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Machine Translation (AREA)
Abstract
The present invention relates to a kind of document subject matter vector abstracting method based on deep learning, belongs to natural language processing technique field.The method of the present invention extracts the semantic information of the deep layer with part using convolutional neural networks, timing information is learnt out using LSTM model, so that the semanteme of vector is more comprehensive, select the implicit cooccurrence relation of context phrase and document subject matter, avoid the shortcomings that some theme vector models based on sentence are for short text, CNN and LSTM model is organically combined using attention mechanism, the Deep Semantics, timing information and significant information for having learnt context more effectively construct the model that grade theme vector extracts.
Description
Technical Field
The invention relates to a document theme vector extraction method based on deep learning, and belongs to the technical field of natural language processing.
Background
In the big data era today, the topic of how to discover massive internet text data is a research focus. The theme of the text data is analyzed, and the document theme vector is essentially deep semantic representing the document and is the inherent combination of the theme and the semantic. The extracted document theme vector can be widely applied to natural language processing tasks, including public opinion analysis of social networks and new media, timely acquisition of news hotspots and the like. Therefore, how to extract the document theme vector efficiently is an important research topic.
For text data, the theme is not necessarily directly embodied on the specific text content, which makes it difficult to mine the theme implied by the text, and the theme meaning included in the document needs to be extracted according to the relationship of words, sentences, paragraphs and the like of the text, and the theme of the document is extracted by combining the chapter relationship of the document. With the enrichment of statistical natural language processing methods and corpora in recent years, text topic modeling methods based on word-topic and document-topic have been proposed in succession, the basic idea is to assume that the topic of each word and document obeys a statistical probability distribution, calculate the probability distribution of the document topic by training the document data, and then perform clustering according to the document topic.
To correctly analyze the topic of each document, the conventional method is to perform topic analysis on each word of the text, but this method has a big problem: the words really determining the text theme only account for a small part of the text words, so that the traditional method can carry out a large amount of analysis on the words irrelevant to the theme, on one hand, the irrelevant words cause large calculation amount for realization, and on the other hand, the problems that the text theme is not accurately extracted and deep semantics of the text cannot be mined by combining the internal relevance relation of the text also exist.
With the improvement of hardware performance and the continuous expansion of data scale, deep learning is widely applied to various fields, and experimental results are greatly improved on the original basis. Deep learning is widely applied in recent years to methods combining word Embedding and document Embedding by the characteristics of elegant models, flexible architectures and the like. Among all deep learning methods, CNN (Convolutional Neural Network) and LSTM (long short-Term Memory Network) are the two most popular. In natural language processing tasks, the text analysis method based on the CNN and LSTM models can well find potential semantic features of the text, and great help is provided for the natural language processing tasks such as automatic abstracting, sentiment analysis, machine translation and the like on semantic analysis calculation.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, solve the problem of how to mine the deep semantics of a text by combining the internal association relation of the text and provide a document theme vector extraction method based on deep learning. The invention focuses more on the analysis of the document theme vector modeling, and digs the relevance implied by the text feature and the theme vector, thereby learning the document theme vector.
The core idea of the invention is as follows: the method comprises the steps of extracting the semantics of a context phrase by using the CNN, inputting the extracted semantics into an LSTM model, extracting the importance of words with different positions and different meanings of a text by using an attention mechanism, thereby retaining important information, completing the organic combination of the CNN and the LSTM model, excavating the internal association between contexts and learning document theme vectors with deep semantics and significance.
The method of the invention is realized by the following technical scheme.
A document theme vector extraction method based on deep learning comprises the following steps:
step one, performing relevant definition, specifically as follows:
definition 1: document D, D ═ w1,w2,...,wi,...,wn],wiThe ith word representing document D;
definition 2: predicting word wd+1Representing a target word to be learned;
definition 3: window words which are formed by a plurality of words appearing continuously in the text, and hidden internal association exists between the window words;
definition 4: a context phrase representing a window word appearing before the predicted word location, the window length being l, the context phrase being denoted as wd-l,wd-l+1,...,wd;
Definition 5: the document theme mapping matrix is obtained by learning of an LDA algorithm (Latent Dirichlet Allocation), and each line represents a theme of a document;
definition 6: n is a radical ofdAnd docid,NdRepresenting the number of documents in the corpus, docidRepresenting a location of a document; each document corresponds to a unique docidWherein, 1 is less than or equal to docid≤Nd;
And step two, learning to obtain the semantic vector of the context phrase by using the CNN.
Step three, learning the semantics of the context phrase by using an LSTM model to obtain a hidden layer vector hd-l,hd-l+1,...,hd。
Step four, organically combining the CNN model and the LSTM model through an attention mechanism to obtain the average value of the semantic vector of the context phrase
Step five, method through logistic regressionUsing the mean of the semantic vectors of the context phrasesAnd the document topic information prediction target word wd+1Obtaining the target word wd+1The prediction probability of (2).
Advantageous effects
Compared with the prior art, the document theme vector extraction method based on deep learning has the following beneficial effects:
1. extracting local deep semantic information by using CNN;
2. the LSTM model is utilized to learn out the time sequence information, so that the vector semantics are more comprehensive;
3. the implicit co-occurrence relation between the context phrase and the document theme is selected, so that the defects of a few sentence-based theme vector models to short texts are avoided;
4. the CNN model and the LSTM model are organically combined by utilizing an attention mechanism, deep semantics, time sequence information and significant information of context are learned, and a model for extracting document theme vectors is more effectively constructed.
Drawings
FIG. 1 is a flowchart of a document theme vector extraction method based on deep learning according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the method of the present invention is further described in detail below with reference to the accompanying drawings and embodiments.
A document theme vector extraction method based on deep learning is implemented in the following basic steps:
step one, performing relevant definition, specifically as follows:
definition 1: document D, D ═ w1,w2,...,wi,...,wn],wiThe ith word representing document D;
definition 2: predicting word wd+1(ii) a Representing a target word to be learned;
definition 3: window words which are formed by a plurality of words appearing continuously in the text, and hidden internal association exists between the window words;
definition 4: contextual phrase (w)d-l,wd-l+1,...,wd) The word is a window word appearing before the position of the predicted word, and the length of the context phrase is l;
definition 5: the document theme mapping matrix is obtained by learning of an LDA algorithm, and each row represents a theme of a document;
definition 6: n is a radical ofdAnd docid,NdRepresenting the number of documents in the corpus, docidRepresenting a location of a document; each document corresponds to a unique docidWherein, 1 is less than or equal to docid≤Nd;
And step two, learning to obtain the semantic vector Context of the Context phrase by using the CNN.
The specific implementation process is as follows:
step 2.1, training a word vector matrix of the document D by utilizing the word2vec and other algorithms, wherein the size of the word vector matrix is n multiplied by m, n represents the length of the word vector matrix, and m represents the width of the word vector matrix;
step 2.2 extracting the word vector corresponding to each word in the context phrase from the word vector matrix obtained in step 2.1, thereby obtaining the context phrase wd-l,wd-l+1,...,wdA vector matrix M of (a);
step 2.3 utilizingThe CNN calculates the semantic vector Context of the Context phrase. Specifically, the vector matrix M and K obtained in the step 2.2 have the layer size of Cl×CmThe convolution kernel of (a) operates;
where K represents the number of convolution kernels, K equals 128 in this embodiment, and ClRepresents the length of the convolution kernel, and Cl=l,CmRepresents the width of the convolution kernel, and Cm=m;
The semantic vector Context of a Context phrase is calculated by equation (1):
1≤k≤K
Context=[Context1,Context2,...,ContextK]
wherein, ContextkThe k-dimension of the semantic vector representing the context phrase, l the context phrase length, m the width of the word vector matrix, i.e. the word vector dimension, d the starting position of the first word in the context phrase, cpqIs the weight parameter of the p-th row and q-th column of the convolution kernel, MpqRepresenting the p-th row and q-th column of data of the vector matrix M, b being the bias parameter of the convolution kernel;
step three, learning the semantics of the context phrase by using an LSTM model to obtain a hidden layer vector hd-l,hd-l+1,...,hd。
The specific implementation process is as follows:
step 3.1, assigning t to d-l, namely t is d-l, and t represents the t-th moment;
step 3.2 reaction of xtAssignment wtWord vector of xtRepresenting the word vector entered at time t, wtA word indicating input at the t-th time;
wherein, wtBy a word vectorMapping the word vector matrix output in step 2.1, namely extracting wtWord vectors at the corresponding positions of the vector matrix M;
step 3.3 reaction of xtObtaining a hidden layer vector h at the time t as an input of an LSTM modelt;
The specific implementation process is as follows:
step 3.3.1 calculating forgetting door f at t momenttThe forgetting control module is used for controlling forgetting information and calculating through a formula (2);
ft=σ(Wfxt+Ufht-1+bf) (2)
wherein, WfRepresenting a parameter matrix, xtRepresenting the word vector, U, input at time tfRepresents a parameter matrix, ht-1Representing the hidden layer vector at time t-1, bfDenotes the offset vector parameter, when t is d-l, ht-1=hd-l-1And h isd-l-1Is a zero vector, sigma represents a Sigmoid function, and is an activation function of the LSTM model;
step 3.3.2 input Gate i at time ttThe new information to be added at the current moment is controlled and calculated through a formula (3);
it=σ(Wixt+Uiht-1+bi) (3)
wherein, WiRepresenting a parameter matrix, xtRepresenting the word vector, U, input at time tiRepresents a parameter matrix, ht-1Representing the hidden layer vector at time t-1, biRepresenting a bias vector parameter, and sigma representing a Sigmoid function, which is an activation function of an LSTM model;
step 3.3.3 calculating updated information at time tCalculating by formula (4);
wherein,representing a parameter matrix, xtRepresenting the word vector entered at time t,represents a parameter matrix, ht-1Representing the hidden layer vector at time t-1,representing a bias vector parameter, and tanh representing a hyperbolic tangent function, which is an activation function of an LSTM model;
step 3.3.4, calculating the information of the t moment, adding the information of the previous moment and the updated information of the current moment to obtain the information, and calculating the information through a formula (5);
wherein, ctInformation indicating time t, ftIndicating forgetting to leave door at time t, ct-1Information indicating the time t-1, itThe input gate at time t is shown,represents the updated information at time t, ° represents the cross-product of the vector;
step 3.3.5 output gate o at time t is calculatedtFor controlling the input information, calculated by equation (6):
ot=σ(Woxt+U0ht-1+bo) (6)
wherein, WoRepresenting a parameter matrix, xtTo representWord vectors, U, input at time t0Represents a parameter matrix, ht-1Representing the hidden layer vector at time t-1, boRepresenting a bias vector parameter, and sigma representing a Sigmoid function, which is an activation function of an LSTM model; wherein, the parameter matrix W in the steps 3.3.1-3.3.3 and 3.3.5f,Uf,Wi,Ui,Wo,UoHas different matrix element sizes, and is biased with vector parameter bf,bi,boThe elements in (A) are different in size; step 3.3.6 calculating the hidden layer vector h at time ttCalculated by equation (7):
ht=otοct(7)
wherein o istOutput gate representing time t, ctInformation indicating time t;
step 3.4, judging whether t is equal to d, if not, adding 1 to t, and skipping to step 3.2; if yes, the hidden layer vector h is outputd-l,hd-l+1,...,hdJumping to the step four;
step four, combining the CNN model and the LSTM model by using an attention mechanism to obtain the average value of the semantic vector of the context phraseThe specific implementation process is as follows:
step 4.1, obtaining an importance factor α of each word on the semantic vector of the context phrase through an attention mechanism by using the context phrase semantic vector obtained in the step two, and specifically calculating through a formula (8):
d-l≤t≤d
α=[αd-l,αd-l+1,...,αd](8)
wherein alpha istRepresenting the importance factor of the word at the time t on the semantic vector of the Context phrase, Context representing the semantic vector of the Context phrase obtained in the step two, xtRepresenting the word vector, x, input at time tiRepresenting the word vector input at the ith moment; t represents the transposition of the vector; e represents an exponential function with e, a natural constant, as the base;
step 4.2, calculating a hidden layer vector h' with weight based on attention mechanism by the formula (9);
h′t=αt*ht
d-l≤t≤d
h′=[h′d-l,h′d-l+1,...,h′d](9)
wherein, h'tRepresenting the weighted hidden layer vector h 'at time t't,αtAn importance factor, h, representing each word at time t on the semantic vector of the context phrasetRepresenting a hidden layer vector at the moment t;
step 4.3 calculate the mean of the semantic vector of the context phrase using mean-posing operationCalculated by equation (10):
wherein, h'tRepresenting the weighted hidden layer vector h 'at time t't;
Step five, the method through logistic regressionMethod for predicting target word w using the mean of context phrase semantic vectors and document topic informationd+1Obtaining the target word wd+1The prediction probability of (2). The specific implementation process is as follows:
step 5.1, learning the document theme mapping matrix by using the LDA algorithm, and then according to the document theme mapping matrix and docidEach document is mapped into a one-dimensional vector D of equal length and width of the word vector matrix in step 2.1z;
Step 5.2 vector D output from step 5.1zAnd the average value of the context phrase semantic vector output in the step fourSplicing to obtain a splicing vector Vd,
Step 5.3 uses the V output from step 5.2dTo predict the target word wd+1. The classification is carried out by a logistic regression method, and the objective function is shown as a formula (11)
Wherein, thetad+1Is the target word wd+1The parameter, theta, corresponding to the locationiWord w in the corresponding word listiCorresponding parameters, | V | representing the size of the vocabulary, VdThe splicing vector obtained in the step 5.2 is used, exp represents an exponential function with e as a base, and sigma represents summation; p denotes probability, y denotes dependent variable, and T denotes matrix transposition.
Step 5.4, calculating a loss function of the target function (11) by using a cross entropy method through a formula (12):
L=-log(P(y=wd+1|Vd)) (12)
wherein, wd+1Representing a target word, VdIs the concatenation vector of step 4.2, log () represents a base-10 logarithmic function;
and (3) updating and solving the loss function (12) by a Sampled Softmax algorithm and a small batch random gradient descent parameter updating method to obtain a document theme vector.
From step one to step five, the document theme vector with deep semantics and saliency is completed.
Examples
This example describes the practice of the present invention, as shown in FIG. 1.
As can be seen from fig. 1, the process of the document theme vector extraction method based on deep learning of the present invention is as follows:
step A, pretreatment; meaningless symbols such as special characters in the corpus are removed first, and then the text is segmented. Word segmentation is the process of dividing a continuous text sequence into individual words according to a given lexical rule, so that a sentence is decomposed into a plurality of continuous meaningful word strings for subsequent analysis. The word segmentation operation utilizes a PTB word segmentation device to perform word segmentation processing. After word segmentation, a vocabulary is constructed for the original text, and in this embodiment, the vocabulary selects 20000 words from the top to the bottom of the training text, that is, the size of the vocabulary V is 20000. After the vocabulary is selected, the vocabulary index data of the original corpus is constructed according to the indexes of the vocabularies, and the text vocabulary index data is used as the input of the model.
And B, learning a word vector by using a word2vec algorithm. Inputting words in the document into a word2vec algorithm to obtain a word vector, wherein the target function of the word vector is as the formula (13):
wherein k is a window word, i is a current word, Corp is the size of a word in a corpus, and a 128-dimensional word vector is obtained by learning through a gradient descent method;
c, extracting a context phrase semantic vector by using the CNN, and learning a context phrase hidden layer vector by using the RNN;
the context phrase semantic vector is extracted by using the CNN, and the context phrase hidden layer vector is learned by using the RNN and is calculated in parallel, specifically in this embodiment:
extracting a context phrase semantic vector by using the CNN; first, a K layer is randomly initialized to C size by using Gaussian distributionl×CmFor a given context phrase wd-l,wd-l+1,...,wdMapping the Context phrases into a matrix with the size of l × m through the word vector learned in the step B, wherein l is the length of the Context phrase, m is the dimension of the word vector, and performing convolution operation on the matrix on a convolution kernel initialized at random in a specific operation mode shown in formula (1), so as to obtain a vector Context, which is the semantic vector of the Context phrase;
learning a context phrase hidden layer vector using the RNN; will context phrase wd-l,wd-l+1,...,wdInputting the corresponding word vectors into the LSTM model in sequence, and inputting the hidden layer vector h at the time 00Each dimension of (2) is set to 0, then forgetting gates, input gates, output gates and final result context phrase hidden layer vectors are calculated in sequence by using the formulas (2) - (7), and the dimension is set to 128;
step D, calculating semantic vectors with weights and calculating document theme distribution by using an attention mechanism;
the calculation of the weighted semantic vector and the calculation of the document theme distribution by using the attention mechanism are calculated in parallel, specifically in the embodiment:
computing weighted words using an attention mechanismaccording to the word vector obtained in the step B and the semantic vector of the context phrase obtained in the step C, performing attention mechanism operation on each word in the context phrase to obtain an attention factor αt,αtIs a real number between 0 and 1, the larger the number of the real number is, the more word vector information of the corresponding position of the real number is kept in the last mean-posing layer, so the size of the real number indicates the importance of the current word in representing the meaning of the whole phrase, that is, the more important word is noticed;
calculating the distribution of document topics; specifically, the document D is input into the LDA algorithm by utilizing the calculation of the LDA algorithm to obtain the theme distribution of each document D, and the theme distribution is directly used as a final result and is marked as Dz;
Step E, predicting a target word and learning a document theme vector; adding weighted semantic vectors to DzAnd (4) directly splicing the words, then maximizing the probability of the target word, and obtaining the document theme vector by a Sampled Softmax algorithm and a small batch random gradient descent parameter updating method.
Claims (5)
1. A document theme vector extraction method based on deep learning is characterized by comprising the following steps:
step one, performing relevant definition, specifically as follows:
definition 1: document D, D ═ w1,w2,...,wi,...,wn],wiThe ith word representing document D;
definition 2: predicting word wd+1(ii) a Representing a target word to be learned;
definition 3: window words which are formed by words continuously appearing in the text, and hidden internal association exists between the window words;
definition 4: the contextual phrase: w is ad-l,wd-l+1,...,wdThe word is a window word appearing before the position of the predicted word, and the length of the context phrase is l;
definition 5: the document theme mapping matrix is obtained by learning of an LDA algorithm, and each row represents a theme of a document;
definition 6: n is a radical ofdAnd docid,NdRepresenting the number of documents in the corpus, docidRepresenting a location of a document; each document corresponds to a unique docidWherein, 1 is less than or equal to docid≤Nd;
Learning to obtain a semantic vector of the context phrase by using a Convolutional Neural Network (CNN);
thirdly, learning the semantics of the context phrase by utilizing the long-short term memory network model LSTM to obtain a hidden layer vector hd-l,hd-l+1,...,hd(ii) a The specific implementation process is as follows:
step 3.1, assigning t to d-l, namely t is d-l, and t represents the t-th moment;
step 3.2 reaction of xtAssignment wtWord vector of xtRepresenting the word vector entered at time t, wtA word indicating input at the t-th time;
wherein, wtThe word vectors are obtained by mapping the word vector matrix output in step 2.1, i.e. extracting wtWord vectors at the corresponding positions of the vector matrix M;
step 3.3 reaction of xtObtaining a hidden layer vector h at the time t as an input of an LSTM modelt;
Step 3.4, judging whether t is equal to d, if not, adding 1 to t, and skipping to step 3.2; if yes, the hidden layer vector h is outputd-l,hd-l+1,...,hdJumping to the step four;
step four, organically combining the CNN model and the LSTM model through an attention mechanism to obtain the average value of the semantic vector of the context phraseThe specific implementation method comprises the following steps:
step 4.1, obtaining an importance factor α of each word on the semantic vector of the context phrase through an attention mechanism by using the context phrase semantic vector obtained in the step two, and specifically calculating through the following formula:
d-l≤t≤d
α=[αd-l,αd-l+1,...,αd]
wherein alpha istRepresenting the importance factor of the word at the time t on the semantic vector of the Context phrase, Context representing the semantic vector of the Context phrase obtained in the step two, xtRepresenting the word vector, x, input at time tiRepresenting the word vector input at the ith moment; t represents the transposition of the vector; e represents an exponential function with e, a natural constant, as the base;
step 4.2, calculating a hidden layer vector h' with weight based on attention mechanism by the following formula;
ht′=αt*ht
d-l≤t≤d
h′=[h′d-l,h′d-l+1,...,hd′]
wherein h ist' denotes the weighted hidden layer vector h at time tt′,αtAn importance factor, h, representing each word at time t on the semantic vector of the context phrasetRepresenting a hidden layer vector at the moment t;
step 4.3 calculate the mean of the semantic vector of the context phrase using mean-posing operationCalculated by the following equation (10):
wherein h ist' denotes the weighted hidden layer vector h at time tt′;
Step five, utilizing the average value of the semantic vector of the context phrase by a logistic regression methodAnd the document topic information prediction target word wd+1Obtaining the target word wd+1The prediction probability of (2).
2. The method for extracting document theme vectors based on deep learning of claim 1, wherein the second step is implemented by the following steps:
step 2.1, training a word vector matrix of the document D, wherein the size of the word vector matrix is n multiplied by m, n represents the length of the word vector matrix, and m represents the width of the word vector matrix;
step 2.2, extracting the word vector corresponding to each word in the context phrase from the word vector matrix obtained in step 2.1 to obtain the context phrase wd-l,wd-l+1,...,wdA vector matrix M of (a);
step 2.3, calculating the semantic vector Context of the Context phrase by using the CNN, wherein the vector matrix M and the K layers obtained in the step 2.2 have the size of Cl×CmThe convolution kernel of (a) operates;
where K denotes the number of convolution kernels, ClRepresents the length of the convolution kernel, and Cl=l,CmRepresents the width of the convolution kernel, and Cm=m;
The semantic vector Context of a Context phrase is calculated by equation (1):
1≤k≤K
Context=[Context1,Context2,...,ContextK]
wherein, ContextkIndicating short contextThe kth dimension of the semantic vector of a word, l represents the length of the context phrase, m represents the width of the word vector matrix, i.e., the word vector dimension, d represents the starting position of the first word in the context phrase, cpqIs the weight parameter of the p-th row and q-th column of the convolution kernel, MpqLine p and column q data representing the vector matrix M, b is the bias parameter for the convolution kernel.
3. The method for extracting document theme vector based on deep learning of claim 1, wherein the specific implementation method of the step 3.3 is as follows:
step 3.3.1 calculating forgetting door f at t momenttThe forgetting control module is used for controlling forgetting information and calculating through a formula (2);
ft=σ(Wfxt+Ufht-1+bf) (2)
wherein, WfRepresenting a parameter matrix, xtRepresenting the word vector, U, input at time tfRepresents a parameter matrix, ht-1Representing the hidden layer vector at time t-1, bfDenotes the offset vector parameter, when t is d-l, ht-1=hd-l-1And h isd-l-1Is a zero vector, sigma represents a Sigmoid function, and is an activation function of the LSTM model;
step 3.3.2 input Gate i at time ttThe new information to be added at the current moment is controlled and calculated through a formula (3);
it=σ(Wixt+Uiht-1+bi) (3)
wherein, WiRepresenting a parameter matrix, xtRepresenting the word vector, U, input at time tiRepresents a parameter matrix, ht-1Representing the hidden layer vector at time t-1, biRepresenting a bias vector parameter, and sigma representing a Sigmoid function, which is an activation function of an LSTM model;
step 3.3.3 calculating updated information at time tBy the formula (4) Calculating;
wherein,representing a parameter matrix, xtRepresenting the word vector entered at time t,represents a parameter matrix, ht-1Representing the hidden layer vector at time t-1,representing a bias vector parameter, and tanh representing a hyperbolic tangent function, which is an activation function of an LSTM model;
step 3.3.4, calculating the information of the t moment, adding the information of the previous moment and the updated information of the current moment to obtain the information, and calculating the information through a formula (5);
wherein, ctInformation indicating time t, ftIndicating forgetting to leave door at time t, ct-1Information indicating the time t-1, itThe input gate at time t is shown,information indicating the update at time t is provided,represents a cross product of the vectors;
step 3.3.5 output gate o at time t is calculatedtFor controlling the input information, calculated by equation (6):
ot=σ(Woxt+U0ht-1+bo) (6)
wherein, WoRepresenting a parameter matrix, xtRepresenting the word vector, U, input at time t0Represents a parameter matrix, ht-1Representing the hidden layer vector at time t-1, boRepresenting a bias vector parameter, and sigma representing a Sigmoid function, which is an activation function of an LSTM model; wherein the parameter matrix Wf,Uf,Wi,Ui,Wo,UoHas different matrix element sizes, and is biased with vector parameter bf,bi,boThe elements in (A) are different in size;
step 3.3.6 calculating the hidden layer vector h at time ttCalculated by equation (7):
wherein o istOutput gate representing time t, ctInformation indicating time t.
4. The method for extracting document theme vectors based on deep learning of claim 1, wherein the concrete implementation method of the fifth step is as follows:
step 5.1 learning the document theme mapping matrix, then according to the document theme mapping matrix and docidEach document is mapped into a-dimensional vector D of equal length and width of the word vector matrix in step 2.1z;
Step 5.2 vector D output from step 5.1zAnd the average value of the context phrase semantic vector output in the step fourSplicing to obtain a splicing vector Vd,
Step 5.3 uses the V output from step 5.2dTo predict the target word wd+1;
Step 5.4, calculating a loss function of the target function (11) by using a cross entropy method through a formula (12):
L=-log(P(y=wd+1|Vd)) (12)
wherein, wd+1Representing a target word, VdIs the concatenation vector of step 4.2, log () represents a base-10 logarithmic function;
and (3) updating and solving the loss function (12) by a Sampled Softmax algorithm and a small batch random gradient descent parameter updating method to obtain a document theme vector.
5. The method for extracting document theme vector based on deep learning as claimed in claim 4, wherein in the step 5.3, classification is performed by a logistic regression method, and the objective function is as the formula (11)
Wherein, thetad+1Is the target word wd+1The parameter, theta, corresponding to the locationiWord w in the corresponding word listiCorresponding parameters, | V | representing the size of the vocabulary, VdThe splicing vector obtained in the step 5.2 is used, exp represents an exponential function with e as a base, and sigma represents summation; p denotes probability, y denotes dependent variable, and T denotes matrix transposition.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810748564.1A CN108984526B (en) | 2018-07-10 | 2018-07-10 | Document theme vector extraction method based on deep learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810748564.1A CN108984526B (en) | 2018-07-10 | 2018-07-10 | Document theme vector extraction method based on deep learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108984526A true CN108984526A (en) | 2018-12-11 |
CN108984526B CN108984526B (en) | 2021-05-07 |
Family
ID=64536620
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810748564.1A Active CN108984526B (en) | 2018-07-10 | 2018-07-10 | Document theme vector extraction method based on deep learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108984526B (en) |
Cited By (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109871532A (en) * | 2019-01-04 | 2019-06-11 | 平安科技(深圳)有限公司 | Text subject extracting method, device and storage medium |
CN109933804A (en) * | 2019-03-27 | 2019-06-25 | 北京信息科技大学 | Merge the keyword abstraction method of subject information and two-way LSTM |
CN109960802A (en) * | 2019-03-19 | 2019-07-02 | 四川大学 | The information processing method and device of narrative text are reported for aviation safety |
CN110083710A (en) * | 2019-04-30 | 2019-08-02 | 北京工业大学 | It is a kind of that generation method is defined based on Recognition with Recurrent Neural Network and the word of latent variable structure |
CN110263343A (en) * | 2019-06-24 | 2019-09-20 | 北京理工大学 | The keyword abstraction method and system of phrase-based vector |
CN110334358A (en) * | 2019-04-28 | 2019-10-15 | 厦门大学 | A kind of phrase table dendrography learning method of context-aware |
CN110378409A (en) * | 2019-07-15 | 2019-10-25 | 昆明理工大学 | It is a kind of based on element association attention mechanism the Chinese get over news documents abstraction generating method |
CN110457674A (en) * | 2019-06-25 | 2019-11-15 | 西安电子科技大学 | A kind of text prediction method of theme guidance |
CN110472047A (en) * | 2019-07-15 | 2019-11-19 | 昆明理工大学 | A kind of Chinese of multiple features fusion gets over news viewpoint sentence abstracting method |
CN110532395A (en) * | 2019-05-13 | 2019-12-03 | 南京大学 | A kind of method for building up of the term vector improved model based on semantic embedding |
CN110766073A (en) * | 2019-10-22 | 2020-02-07 | 湖南科技大学 | Mobile application classification method for strengthening topic attention mechanism |
CN110781256A (en) * | 2019-08-30 | 2020-02-11 | 腾讯大地通途(北京)科技有限公司 | Method and device for determining POI (Point of interest) matched with Wi-Fi (Wireless Fidelity) based on transmitted position data |
CN110825848A (en) * | 2019-06-10 | 2020-02-21 | 北京理工大学 | Text classification method based on phrase vectors |
CN111125434A (en) * | 2019-11-26 | 2020-05-08 | 北京理工大学 | Relation extraction method and system based on ensemble learning |
CN111414483A (en) * | 2019-01-04 | 2020-07-14 | 阿里巴巴集团控股有限公司 | Document processing device and method |
CN111696624A (en) * | 2020-06-08 | 2020-09-22 | 天津大学 | DNA binding protein identification and function annotation deep learning method based on self-attention mechanism |
CN111753540A (en) * | 2020-06-24 | 2020-10-09 | 云南电网有限责任公司信息中心 | Method and system for collecting text data to perform Natural Language Processing (NLP) |
CN112597311A (en) * | 2020-12-28 | 2021-04-02 | 东方红卫星移动通信有限公司 | Terminal information classification method and system based on low-earth-orbit satellite communication |
CN112632966A (en) * | 2020-12-30 | 2021-04-09 | 绿盟科技集团股份有限公司 | Alarm information marking method, device, medium and equipment |
CN112685538A (en) * | 2020-12-30 | 2021-04-20 | 北京理工大学 | Text vector retrieval method combined with external knowledge |
CN112699662A (en) * | 2020-12-31 | 2021-04-23 | 太原理工大学 | False information early detection method based on text structure algorithm |
CN112966551A (en) * | 2021-01-29 | 2021-06-15 | 湖南科技学院 | Method and device for acquiring video frame description information and electronic equipment |
WO2021155705A1 (en) * | 2020-02-06 | 2021-08-12 | 支付宝(杭州)信息技术有限公司 | Text prediction model training method and apparatus |
CN115763167A (en) * | 2022-11-22 | 2023-03-07 | 黄华集团有限公司 | Solid cabinet breaker and control method thereof |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105069143A (en) * | 2015-08-19 | 2015-11-18 | 百度在线网络技术(北京)有限公司 | Method and device for extracting keywords from document |
CN106547735A (en) * | 2016-10-25 | 2017-03-29 | 复旦大学 | The structure and using method of the dynamic word or word vector based on the context-aware of deep learning |
CN106909537A (en) * | 2017-02-07 | 2017-06-30 | 中山大学 | A kind of polysemy analysis method based on topic model and vector space |
CN106919557A (en) * | 2017-02-22 | 2017-07-04 | 中山大学 | A kind of document vector generation method of combination topic model |
CN107092596A (en) * | 2017-04-24 | 2017-08-25 | 重庆邮电大学 | Text emotion analysis method based on attention CNNs and CCR |
CN107423282A (en) * | 2017-05-24 | 2017-12-01 | 南京大学 | Semantic Coherence Sexual Themes and the concurrent extracting method of term vector in text based on composite character |
CN107562792A (en) * | 2017-07-31 | 2018-01-09 | 同济大学 | A kind of question and answer matching process based on deep learning |
-
2018
- 2018-07-10 CN CN201810748564.1A patent/CN108984526B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105069143A (en) * | 2015-08-19 | 2015-11-18 | 百度在线网络技术(北京)有限公司 | Method and device for extracting keywords from document |
CN106547735A (en) * | 2016-10-25 | 2017-03-29 | 复旦大学 | The structure and using method of the dynamic word or word vector based on the context-aware of deep learning |
CN106909537A (en) * | 2017-02-07 | 2017-06-30 | 中山大学 | A kind of polysemy analysis method based on topic model and vector space |
CN106919557A (en) * | 2017-02-22 | 2017-07-04 | 中山大学 | A kind of document vector generation method of combination topic model |
CN107092596A (en) * | 2017-04-24 | 2017-08-25 | 重庆邮电大学 | Text emotion analysis method based on attention CNNs and CCR |
CN107423282A (en) * | 2017-05-24 | 2017-12-01 | 南京大学 | Semantic Coherence Sexual Themes and the concurrent extracting method of term vector in text based on composite character |
CN107562792A (en) * | 2017-07-31 | 2018-01-09 | 同济大学 | A kind of question and answer matching process based on deep learning |
Non-Patent Citations (2)
Title |
---|
GUANGXU XUN 等: "Topic Discovery for Short Texts Using Word Embeddings", 《2016 IEEE 16TH INTERNATIONAL CONFERENCE ON DATA MINING》 * |
胡朝举 等: "基于词向量技术和混合神经网络的情感分析", 《计算机应用研究》 * |
Cited By (41)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109871532A (en) * | 2019-01-04 | 2019-06-11 | 平安科技(深圳)有限公司 | Text subject extracting method, device and storage medium |
CN111414483A (en) * | 2019-01-04 | 2020-07-14 | 阿里巴巴集团控股有限公司 | Document processing device and method |
WO2020140633A1 (en) * | 2019-01-04 | 2020-07-09 | 平安科技(深圳)有限公司 | Text topic extraction method, apparatus, electronic device, and storage medium |
CN111414483B (en) * | 2019-01-04 | 2023-03-28 | 阿里巴巴集团控股有限公司 | Document processing device and method |
CN109960802A (en) * | 2019-03-19 | 2019-07-02 | 四川大学 | The information processing method and device of narrative text are reported for aviation safety |
CN109933804A (en) * | 2019-03-27 | 2019-06-25 | 北京信息科技大学 | Merge the keyword abstraction method of subject information and two-way LSTM |
CN110334358A (en) * | 2019-04-28 | 2019-10-15 | 厦门大学 | A kind of phrase table dendrography learning method of context-aware |
CN110083710A (en) * | 2019-04-30 | 2019-08-02 | 北京工业大学 | It is a kind of that generation method is defined based on Recognition with Recurrent Neural Network and the word of latent variable structure |
CN110083710B (en) * | 2019-04-30 | 2021-04-02 | 北京工业大学 | Word definition generation method based on cyclic neural network and latent variable structure |
CN110532395B (en) * | 2019-05-13 | 2021-09-28 | 南京大学 | Semantic embedding-based word vector improvement model establishing method |
CN110532395A (en) * | 2019-05-13 | 2019-12-03 | 南京大学 | A kind of method for building up of the term vector improved model based on semantic embedding |
CN110825848A (en) * | 2019-06-10 | 2020-02-21 | 北京理工大学 | Text classification method based on phrase vectors |
CN110825848B (en) * | 2019-06-10 | 2022-08-09 | 北京理工大学 | Text classification method based on phrase vectors |
CN110263343B (en) * | 2019-06-24 | 2021-06-15 | 北京理工大学 | Phrase vector-based keyword extraction method and system |
CN110263343A (en) * | 2019-06-24 | 2019-09-20 | 北京理工大学 | The keyword abstraction method and system of phrase-based vector |
CN110457674A (en) * | 2019-06-25 | 2019-11-15 | 西安电子科技大学 | A kind of text prediction method of theme guidance |
CN110378409A (en) * | 2019-07-15 | 2019-10-25 | 昆明理工大学 | It is a kind of based on element association attention mechanism the Chinese get over news documents abstraction generating method |
CN110472047B (en) * | 2019-07-15 | 2022-12-13 | 昆明理工大学 | Multi-feature fusion Chinese-Yue news viewpoint sentence extraction method |
CN110472047A (en) * | 2019-07-15 | 2019-11-19 | 昆明理工大学 | A kind of Chinese of multiple features fusion gets over news viewpoint sentence abstracting method |
CN110781256A (en) * | 2019-08-30 | 2020-02-11 | 腾讯大地通途(北京)科技有限公司 | Method and device for determining POI (Point of interest) matched with Wi-Fi (Wireless Fidelity) based on transmitted position data |
CN110781256B (en) * | 2019-08-30 | 2024-02-23 | 腾讯大地通途(北京)科技有限公司 | Method and device for determining POI matched with Wi-Fi based on sending position data |
CN110766073B (en) * | 2019-10-22 | 2023-10-27 | 湖南科技大学 | Mobile application classification method for strengthening topic attention mechanism |
CN110766073A (en) * | 2019-10-22 | 2020-02-07 | 湖南科技大学 | Mobile application classification method for strengthening topic attention mechanism |
CN111125434B (en) * | 2019-11-26 | 2023-06-27 | 北京理工大学 | Relation extraction method and system based on ensemble learning |
CN111125434A (en) * | 2019-11-26 | 2020-05-08 | 北京理工大学 | Relation extraction method and system based on ensemble learning |
WO2021155705A1 (en) * | 2020-02-06 | 2021-08-12 | 支付宝(杭州)信息技术有限公司 | Text prediction model training method and apparatus |
CN111696624A (en) * | 2020-06-08 | 2020-09-22 | 天津大学 | DNA binding protein identification and function annotation deep learning method based on self-attention mechanism |
CN111696624B (en) * | 2020-06-08 | 2022-07-12 | 天津大学 | DNA binding protein identification and function annotation deep learning method based on self-attention mechanism |
CN111753540B (en) * | 2020-06-24 | 2023-04-07 | 云南电网有限责任公司信息中心 | Method and system for collecting text data to perform Natural Language Processing (NLP) |
CN111753540A (en) * | 2020-06-24 | 2020-10-09 | 云南电网有限责任公司信息中心 | Method and system for collecting text data to perform Natural Language Processing (NLP) |
CN112597311B (en) * | 2020-12-28 | 2023-07-11 | 东方红卫星移动通信有限公司 | Terminal information classification method and system based on low-orbit satellite communication |
CN112597311A (en) * | 2020-12-28 | 2021-04-02 | 东方红卫星移动通信有限公司 | Terminal information classification method and system based on low-earth-orbit satellite communication |
CN112632966A (en) * | 2020-12-30 | 2021-04-09 | 绿盟科技集团股份有限公司 | Alarm information marking method, device, medium and equipment |
CN112685538B (en) * | 2020-12-30 | 2022-10-14 | 北京理工大学 | Text vector retrieval method combined with external knowledge |
CN112632966B (en) * | 2020-12-30 | 2023-07-21 | 绿盟科技集团股份有限公司 | Alarm information marking method, device, medium and equipment |
CN112685538A (en) * | 2020-12-30 | 2021-04-20 | 北京理工大学 | Text vector retrieval method combined with external knowledge |
CN112699662B (en) * | 2020-12-31 | 2022-08-16 | 太原理工大学 | False information early detection method based on text structure algorithm |
CN112699662A (en) * | 2020-12-31 | 2021-04-23 | 太原理工大学 | False information early detection method based on text structure algorithm |
CN112966551A (en) * | 2021-01-29 | 2021-06-15 | 湖南科技学院 | Method and device for acquiring video frame description information and electronic equipment |
CN115763167A (en) * | 2022-11-22 | 2023-03-07 | 黄华集团有限公司 | Solid cabinet breaker and control method thereof |
CN115763167B (en) * | 2022-11-22 | 2023-09-22 | 黄华集团有限公司 | Solid cabinet circuit breaker and control method thereof |
Also Published As
Publication number | Publication date |
---|---|
CN108984526B (en) | 2021-05-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108984526B (en) | Document theme vector extraction method based on deep learning | |
CN109753566B (en) | Model training method for cross-domain emotion analysis based on convolutional neural network | |
CN110866117B (en) | Short text classification method based on semantic enhancement and multi-level label embedding | |
CN110502749B (en) | Text relation extraction method based on double-layer attention mechanism and bidirectional GRU | |
CN109376242B (en) | Text classification method based on cyclic neural network variant and convolutional neural network | |
CN104834747B (en) | Short text classification method based on convolutional neural networks | |
CN110196980B (en) | Domain migration on Chinese word segmentation task based on convolutional network | |
CN108829684A (en) | A kind of illiteracy Chinese nerve machine translation method based on transfer learning strategy | |
CN108984524A (en) | A kind of title generation method based on variation neural network topic model | |
CN111930942B (en) | Text classification method, language model training method, device and equipment | |
CN111738007B (en) | Chinese named entity identification data enhancement algorithm based on sequence generation countermeasure network | |
CN113591483A (en) | Document-level event argument extraction method based on sequence labeling | |
CN106980609A (en) | A kind of name entity recognition method of the condition random field of word-based vector representation | |
CN110489750A (en) | Burmese participle and part-of-speech tagging method and device based on two-way LSTM-CRF | |
CN113505200B (en) | Sentence-level Chinese event detection method combined with document key information | |
CN110555084A (en) | remote supervision relation classification method based on PCNN and multi-layer attention | |
CN110569355B (en) | Viewpoint target extraction and target emotion classification combined method and system based on word blocks | |
CN110134950B (en) | Automatic text proofreading method combining words | |
CN114417851B (en) | Emotion analysis method based on keyword weighted information | |
CN111753088A (en) | Method for processing natural language information | |
CN110276396A (en) | Picture based on object conspicuousness and cross-module state fusion feature describes generation method | |
CN114925687B (en) | Chinese composition scoring method and system based on dynamic word vector characterization | |
CN113988054A (en) | Entity identification method for coal mine safety field | |
CN113051892A (en) | Chinese word sense disambiguation method based on transformer model | |
CN113051886A (en) | Test question duplicate checking method and device, storage medium and equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |