CN108984526A - A kind of document subject matter vector abstracting method based on deep learning - Google Patents
A kind of document subject matter vector abstracting method based on deep learning Download PDFInfo
- Publication number
- CN108984526A CN108984526A CN201810748564.1A CN201810748564A CN108984526A CN 108984526 A CN108984526 A CN 108984526A CN 201810748564 A CN201810748564 A CN 201810748564A CN 108984526 A CN108984526 A CN 108984526A
- Authority
- CN
- China
- Prior art keywords
- vector
- indicate
- moment
- context
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/258—Heading extraction; Automatic titling; Numbering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Machine Translation (AREA)
Abstract
The present invention relates to a kind of document subject matter vector abstracting method based on deep learning, belongs to natural language processing technique field.The method of the present invention extracts the semantic information of the deep layer with part using convolutional neural networks, timing information is learnt out using LSTM model, so that the semanteme of vector is more comprehensive, select the implicit cooccurrence relation of context phrase and document subject matter, avoid the shortcomings that some theme vector models based on sentence are for short text, CNN and LSTM model is organically combined using attention mechanism, the Deep Semantics, timing information and significant information for having learnt context more effectively construct the model that grade theme vector extracts.
Description
Technical field
The present invention relates to a kind of document subject matter vector abstracting method based on deep learning, belongs to natural language processing technique
Field.
Background technique
In big data era of today, how to find that the theme of magnanimity internet text notebook data is-a research emphasis.To text
The theme of notebook data is analyzed, and document subject matter vector is substantially the Deep Semantics for indicating document, is theme and semantic interior
It is combining.Extracting document subject matter vector can be widely used in natural language processing task, including social networks and new
The analysis of public opinion of media, the timely acquisition of hot news etc..Therefore, how efficiently to extract document subject matter vector be-it is a
Important subject.
For text data, theme might not be embodied directly on specific word content, this allows for digging
The implicit theme of pick text becomes difficult, and needing to extract document according to relationships such as the word of text, sentence, paragraphs is included
Thematic meaning, and combine the chapter relationship of document to extract the theme of document.In recent years with statistics natural language at
Reason method and corpus it is abundant, the text subject modeling method based on " word-theme " " document-theme " is also mentioned in succession
Out, basic thought is the assumption that the theme of each word and document is to obey a statistical probability distribution, by corpus number
According to training, calculate the probability distribution of its document subject matter, then clustered further according to this document subject matter.
Want Correct Analysis to go out the theme of each document, conventional method be subject analysis is carried out to each word of text, but
It is that this method has one very big: really determines that the word of text subject only accounts for few portion of text word in fact
Point, therefore conventional method can largely analyze the word unrelated with theme, the unrelated word of this aspect causes to realize
It is next computationally intensive, it is on the other hand inaccurate there is also being extracted for text subject, text internal association degree relationship cannot be combined
The problem of excavating text Deep Semantics.
With the promotion of hardware performance and the continuous expansion of data scale, deep learning is also widely used in each neck
Among domain, experimental result is significantly improved on its original base.Deep learning is with its graceful model, flexible framework etc.
Feature had obtained extensive utilization in recent years in conjunction in the method for word Embedding and document Embedding.All
In deep learning method, CNN (Convolutional Neural Network, convolutional neural networks) and LSTM model (Long
Short-Term Memory, shot and long term memory network model) it is two of most mainstream.In natural language processing task, it is based on
The text analyzing method of CNN and LSTM model can be good at finding the potential applications feature of text, in semantic analysis calculating
The natural language processings task such as automatic abstract, sentiment analysis, machine translation is given greatly to help.
Summary of the invention
The purpose of the invention is to overcome the deficiencies of existing technologies, solve how text internal association degree relationship to be combined to dig
The problem of digging text Deep Semantics, proposes a kind of document subject matter vector abstracting method based on deep learning.The present invention is document
Theme vector modeling more focuses in the analysis to document compression, excavates text feature and theme vector is hidden
The correlation contained, to learn document subject matter vector.
The core idea of the invention is as follows: the semanteme of context phrase is extracted using CNN, and the semanteme extracted is input to
In LSTM model, the different location of text and the importance of different meaning words are extracted using attention mechanism, to remain
Important information also completes the combination of CNN and LSTM model, excavates the internal association between context, has learnt tool
There are Deep Semantics and significant document subject matter vector.
The method of the present invention is achieved through the following technical solutions.
A kind of document subject matter vector abstracting method based on deep learning, comprising the following steps:
Step 1: related definition is carried out, it is specific as follows:
Definition 1: document D, D=[w1,w2,...,wi,...,wn], wiIndicate i-th of word of document D;
Define 2: prediction word wd+1, indicate the target word for needing to learn;
Define 3: window word is made of the words continuously occurred several in text, between window word exist hide
Internal association;
Definition 4: context phrase, the window word that expression prediction word position occurs before, length of window l,
Context phrase is denoted as wd-l,wd-l+1,...,wd;
Define 5: document subject matter mapping matrix is learnt by LDA algorithm (Latent Dirichlet Allocation)
It arrives, every a line represents the theme of a document;
Define 6:NdAnd docid, NdIndicate the number of document in corpus, docidIndicate the position of document;Each document pair
Answer a unique docid, wherein 1≤docid≤Nd;
Step 2: study obtains the semantic vector of context phrase using CNN.
Step 3: obtaining hidden layer vector h using the semanteme of LSTM model learning context phrased-l,hd-l+1,...,
hd。
Step 4: CNN and LSTM model is organically combined by attention mechanism, obtain context phrase semantic vector
Average value
Step 5: utilizing the average value of context phrase semantic vector by the method for logistic regressionAnd text
Shelves subject information predicts target word wd+1, obtain target word wd+1Prediction probability.
Beneficial effect
A kind of document subject matter vector abstracting method based on deep learning of the present invention, compares the prior art, and have has as follows
Beneficial effect:
1. extracting the semantic information of the deep layer with part using CNN;
2. timing information is learnt out using LSTM model, so that the semanteme of vector is more comprehensive;
3. select context phrase and document subject matter implicit cooccurrence relation, avoid it is some based on the theme of sentence to
Measure the shortcomings that model is for short text;
4. CNN and LSTM model is organically combined using attention mechanism, learnt context Deep Semantics,
Timing information and significant information more effectively construct the model that grade theme vector extracts.
Detailed description of the invention
Fig. 1 is a kind of flow chart of the document subject matter vector abstracting method based on deep learning of the present invention.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, below according to accompanying drawings and embodiments pair
Abstract method of the present invention is further described.
A kind of document subject matter vector abstracting method based on deep learning, basic implementation process are as follows:
Step 1: related definition is carried out, it is specific as follows:
Definition 1: document D, D=[w1,w2,...,wi,...,wn], wiIndicate i-th of word of document D;
Define 2: prediction word wd+1;, indicate the target word for needing to learn;
Define 3: window word is made of the words continuously occurred several in text, between window word exist hide
Internal association;
Define 4: context phrase (wd-l,wd-l+1,...,wd), indicate the window occurred before prediction word position
Word, context phrase length are l;
Define 5: document subject matter mapping matrix learns to obtain by LDA algorithm, and every a line represents the theme of a document;
Define 6:NdAnd docid, NdIndicate the number of document in corpus, docidIndicate the position of document;Each document pair
Answer a unique docid, wherein 1≤docid≤Nd;
Step 2: study obtains the semantic vector Context of context phrase using CNN.
The specific implementation process is as follows:
Step 2.1 utilizes word2vec scheduling algorithm, the term vector matrix of Training document D, and term vector matrix size is n × m,
N indicates the length of term vector matrix, the width of m table term vector matrix;
The term vector matrix that step 2.2 obtains the corresponding term vector of word each in context phrase from step 2.1 is taken out
It takes out, to obtain context phrase wd-l,wd-l+1,...,wdVector matrix M;
Step 2.3 calculates the semantic vector Context of context phrase using CNN.Especially by step 2.2 obtain to
M and K layers of size of moment matrix are Cl×CmConvolution kernel operated;
Wherein, K indicates the number of convolution kernel, and K is equal to 128, C in present embodimentlIndicate the length of convolution kernel, and Cl
=l, CmIndicate the width of convolution kernel, and Cm=m;
The semantic vector Context of context phrase is calculated by formula (1):
1≤k≤K
Context=[Context1,Context2,...,ContextK]
Wherein, ContextkIndicate the kth dimension of the semantic vector of context phrase, l indicates context phrase length, m table
Show that the width of term vector matrix, i.e. term vector dimension, d indicate the initial position of first word in context phrase, cpqIt is convolution
The weight parameter of core pth row and q column, MpqIndicate that the pth row and q column data of vector matrix M, b are the biasing ginsengs of convolution kernel
Number;
Step 3: obtaining hidden layer vector h using the semanteme of LSTM model learning context phrased-l,hd-l+1,...,
hd。
The specific implementation process is as follows:
T assignment d-l, i.e. t=d-l, t are indicated t moment by step 3.1;
Step 3.2 is by xtAssignment wtTerm vector, xtIndicate the term vector of t moment input, wtIndicate that t moment inputs
Word;
Wherein, wtThe term vector matrix that is exported by step 2.1 of term vector map to obtain, i.e. extraction wtIn vector matrix M
The term vector of corresponding position;
Step 3.3 is by xtAs the input of LSTM model, the hidden layer vector h of t moment is obtainedt;
The specific implementation process is as follows:
The forgetting door f of step 3.3.1 calculating t momentt, information is forgotten for controlling, and is calculated by formula (2);
ft=σ (Wfxt+Ufht-1+bf) (2)
Wherein, WfExpression parameter matrix, xtIndicate the term vector of t moment input, UfExpression parameter matrix, ht-1Indicate t-
The hidden layer vector at 1 moment, bfIndicate bias vector parameter, as t=d-l, ht-1=hd-l-1, and hd-l-1For null vector, σ table
Show Sigmoid function, is the activation primitive of LSTM model;
The input gate i of step 3.3.2 calculating t momentt, new information to be added is needed for controlling current time, passes through public affairs
Formula (3) calculates;
it=σ (Wixt+Uiht-1+bi) (3)
Wherein, WiExpression parameter matrix, xtIndicate the term vector of t moment input, UiExpression parameter matrix, ht-1Indicate t-
The hidden layer vector at 1 moment, biIndicate bias vector parameter, it is the activation primitive of LSTM model that σ, which indicates Sigmoid function,;
Step 3.3.3 calculates the information that t moment updatesIt is calculated by formula (4);
Wherein,Expression parameter matrix, xtIndicate the term vector of t moment input,Expression parameter matrix, ht-1It indicates
The hidden layer vector at t-1 moment,Indicate bias vector parameter, it is the activation of LSTM model that tanh, which indicates hyperbolic tangent function,
Function;
Step 3.3.4 calculates the information of t moment, and the information that the information of last moment is updated with current time is added
It arrives, is calculated by formula (5);
Wherein, ctIndicate the information of t moment, ftIndicate that t moment forgets door, ct-1Indicate the information at t-1 moment, itIndicate t
The input gate at moment,Indicate the information that t moment updates, the multiplication cross of ° expression vector;
The out gate o of step 3.3.5 calculating t momentt, for controlling input information, pass through formula (6) calculating:
ot=σ (Woxt+U0ht-1+bo) (6)
Wherein, WoExpression parameter matrix, xtIndicate the term vector of t moment input, U0Expression parameter matrix, ht-1Indicate t-
The hidden layer vector at 1 moment, boIndicate bias vector parameter, it is the activation primitive of LSTM model that σ, which indicates Sigmoid function,;Its
In, the parameter matrix W in step 3.3.1-3.3.3 and step 3.3.5f, Uf, Wi, Ui,Wo, UoMatrix element it is big
Small difference, bias vector parameter bf, bi,boIn element size it is different;The hidden layer vector of step 3.3.6 calculating t moment
ht, it is calculated by formula (7):
ht=otοct (7)
Wherein, otIndicate the out gate of t moment, ctIndicate the information of t moment;
Step 3.4 judges whether t is equal to d, and t adds 1 if being not equal to, leapfrog rapid 3.2;If being equal to, export hidden layer to
Measure hd-l,hd-l+1,...,hd, jump into step 4;
Step 4:, by CNN and LSTM models coupling, obtaining the flat of context phrase semantic vector using attention mechanism
Mean valueThe specific implementation process is as follows:
The context phrase semantic vector that step 4.1 is obtained using step 2, obtains each word by attention mechanism
Importance factor α on the semantic vector of context phrase is calculated especially by formula (8):
d-l≤t≤d
α=[αd-l,αd-l+1,...,αd] (8)
Wherein, αtIndicate importance factor of the t moment word on the semantic vector of context phrase, Context is indicated
The semantic vector of the context phrase obtained in step 2, xtIndicate the term vector of t moment input, xiIndicate that the i-th moment inputted
Term vector;The transposition of T expression vector;E indicates with e, i.e. the natural constant exponential function that is bottom;
Step 4.2 calculates the hidden layer vector h ' based on attention mechanism Weight, is calculated by formula (9);
h′t=αt*ht
d-l≤t≤d
H '=[h 'd-l,h′d-l+1,...,h′d] (9)
Wherein, h 'tIndicate t moment weight hidden layer vector h 't, αtIndicate each word of t moment in the language of context phrase
Importance factor on adopted vector, htIndicate t moment hidden layer vector;
Step 4.3 is operated using mean-pooling, calculates the average value of context phrase semantic vector
It is calculated by formula (10):
Wherein, h 'tIndicate t moment weight hidden layer vector h 't;
Step 5: being believed by the method for logistic regression using the average value and document subject matter of context phrase semantic vector
Breath prediction target word wd+1, obtain target word wd+1Prediction probability.The specific implementation process is as follows:
Step 5.1 learns document subject matter mapping matrix using LDA algorithm, then according to document subject matter mapping matrix and docid
By each document be mapped to term vector matrix width in a length and step 2.1 it is equal-dimensional vector Dz;
The vector D that step 5.2 exports step 5.1zWith the average value of the context phrase semantic vector of step 4 outputIt is stitched together, obtains splicing vector Vd,
The V that step 5.3 is exported using step 5.2dTo predict target word wd+1.Especially by logistic regression method into
Row classification, objective function such as formula (11)
Wherein, θd+1It is target word wd+1The corresponding parameter in position, θiWord w in corresponding vocabularyiCorresponding parameter, |
V | indicate the size of vocabulary, VdIt is the splicing vector that step 5.2 obtains, exp indicates that using e as the exponential function at bottom, Σ expression is asked
With;P indicates that probability, y indicate dependent variable, T representing matrix transposition.
Step 5.4 passes through the loss function of formula (12) calculating target function (11) using the method for cross entropy:
L=-log (P (y=wd+1|Vd)) (12)
Wherein, wd+1Indicate target word, VdIt is the splicing vector of step 4.2, log () indicates denary logarithm letter
Number;
Loss function (12) by Sampled Softmax algorithm and small lot stochastic gradient descent parameter updating method into
Row, which updates, to be solved, and document subject matter vector is obtained.
So far, it from step 1 to step 5, completes with Deep Semantics and significant document subject matter vector.
Embodiment
The present embodiment describes specific implementation process of the invention, as shown in Figure 1.
It will be seen from figure 1 that a kind of process of the document subject matter vector abstracting method based on deep learning of the present invention is as follows:
Step A pretreatment;The meaningless symbol in corpus, such as spcial character are got rid of first, and then text is carried out
Participle.Participle is exactly the process that continuous word sequence is divided into individual word according to set morphological rule, thus will
Sentences decomposition is that several continuous significant word strings are used for subsequent analysis.Participle operation is divided using PTB segmenter
Word processing.After participle, vocabulary is constructed to original text, in the present embodiment, what vocabulary was chosen is the word of training text
Frequency preceding 20000 words from high in the end, that is, the size of vocabulary V is 20000.After vocabulary selection, according to the rope of vocabulary
Draw the vocabulary index data for constructing original language material, this input of text vocabulary index data as model.
Step B learns term vector using word2vec algorithm.Word in document is input in word2vec algorithm, is obtained
Term vector, objective function such as formula (13):
Wherein, k is window word, and i is current word, and Corp is word size in corpus, utilizes gradient descent method
Study obtains the term vector of 128 dimensions;
Step C extracts context phrase semantic vector using CNN, learns context phrase hidden layer vector using RNN;
Wherein, context phrase semantic vector is extracted using CNN, is using RNN study context phrase hidden layer vector
And column count, specific to the present embodiment:
Context phrase semantic vector is extracted using CNN;One K layers of random initializtion are carried out first with Gaussian Profile
Size is Cl×CmConvolution kernel, for given context phrase wd-l,wd-l+1,...,wd, the term vector acquired by step B
These context phrases are mapped to the matrix that size is l × m, wherein l is the length of context phrase, and m is the dimension of term vector
The matrix is carried out convolution operation, shown in concrete operations mode such as formula (1), in this way by degree on the convolution kernel of random initializtion
A vector Context is just obtained, which is exactly the semantic vector of context phrase;
Learn context phrase hidden layer vector using RNN;By context phrase wd-l,wd-l+1,...,wdCorresponding word to
Amount is successively inputted in LSTM model, by the hidden layer vector h at 0 moment0Each dimension be set as 0, then using public affairs
Formula (2)-(7), which successively calculate, forgets door, input gate, out gate and final result context phrase hidden layer vector, dimension size
It is set as 128;
Step D is calculated the semantic vector of Weight using attention mechanism, calculates document subject matter distribution;
Wherein, calculating the semantic vector of Weight using attention mechanism, calculate document subject matter distribution is simultaneously column count,
Specific to the present embodiment:
The semantic vector of Weight is calculated using attention mechanism;It is obtained according to the obtained term vector of step B and step C
The semantic vector of context phrase carries out the operation of attention mechanism, attention getting for each of context phrase word
Power factor-alphat, αtIt is a real number between 0 to 1, number is bigger, then the term vector information of its corresponding position will be got over
More is retained in last mean-pooling layer, therefore its size illustrates that current term is characterizing entire phrase meaning
Importance, that is to say, that more important word will more be paid attention to;
Calculate document subject matter distribution;It is specifically calculated using LDA algorithm, document D is input in LDA algorithm first, is obtained
To the theme distribution of each document D, which is denoted as D directly as final resultz;
Step E predicts target word, learns document subject matter vector;By the semantic vector and D of WeightzDirect splicing rises
Come, the maximum probability for target word occur is joined by Sampled Softmax algorithm and small lot stochastic gradient descent
Number update method can acquire document subject matter vector.
Claims (5)
1. a kind of document subject matter vector abstracting method based on deep learning, which comprises the following steps:
Step 1: related definition is carried out, it is specific as follows:
Definition 1: document D, D=[w1,w2,...,wi,...,wn], wiIndicate i-th of word of document D;
Define 2: prediction word wd+1;, indicate the target word for needing to learn;
Define 3: window word is made of the word continuously occurred in text, there is hiding internal association between window word;
Definition 4: context phrase: wd-l,wd-l+1,...,wd, indicate the window word occurred before prediction word position,
Context phrase length is l;
Define 5: document subject matter mapping matrix learns to obtain by LDA algorithm, and every a line represents the theme of a document;
Define 6:NdAnd docid, NdIndicate the number of document in corpus, docidIndicate the position of document;Each document is corresponding only
One docid, wherein 1≤docid≤Nd;
Step 2: study obtains the semantic vector of context phrase using convolutional neural networks CNN;
Step 3: obtaining hidden layer vector h using the semanteme of shot and long term memory network model LSTM study context phrased-l,
hd-l+1..., hd;The specific implementation process is as follows:
T assignment d-l, i.e. t=d-l, t are indicated t moment by step 3.1;
Step 3.2 is by xtAssignment wtTerm vector, xtIndicate the term vector of t moment input, wtIndicate the list of t moment input
Word;
Wherein, wtThe term vector matrix that is exported by step 2.1 of term vector map to obtain, i.e. extraction wtIt is corresponding in vector matrix M
The term vector of position;
Step 3.3 is by xtAs the input of LSTM model, the hidden layer vector h of t moment is obtainedt;
Step 3.4 judges whether t is equal to d, and t adds 1 if being not equal to, leapfrog rapid 3.2;If being equal to, hidden layer vector is exported
hd-l,hd-l+1,...,hd, jump into step 4;
Step 4: CNN and LSTM model is organically combined by attention mechanism, obtain the flat of context phrase semantic vector
Mean valueConcrete methods of realizing is as follows:
The context phrase semantic vector that step 4.1 is obtained using step 2 obtains each word upper by attention mechanism
The hereafter importance factor α on the semantic vector of phrase is calculated especially by following formula:
d-l≤t≤d
α=[αd-l,αd-l+1,...,αd]
Wherein, αtIndicate importance factor of the t moment word on the semantic vector of context phrase, Context indicates step 2
The semantic vector of the context phrase of middle acquisition, xtIndicate the term vector of t moment input, xiIndicate the i-th moment input word to
Amount;The transposition of T expression vector;E indicates with e, i.e. the natural constant exponential function that is bottom;
Step 4.2 calculates the hidden layer vector h ' based on attention mechanism Weight, is calculated by following formula;
ht'=αt*ht
d-l≤t≤d
H '=[h 'd-l,h′d-l+1,...,hd′]
Wherein, ht' indicate t moment weight hidden layer vector ht', αtIndicate each word of t moment context phrase it is semantic to
Importance factor in amount, htIndicate t moment hidden layer vector;
Step 4.3 is operated using mean-pooling, calculates the average value of context phrase semantic vectorPass through
Following formula (10) calculates:
Wherein, ht' indicate t moment weight hidden layer vector ht′;
Step 5: utilizing the average value of context phrase semantic vector by the method for logistic regressionWith document master
Inscribe information prediction target word wd+1, obtain target word wd+1Prediction probability.
2. a kind of document subject matter vector abstracting method based on deep learning as described in claim 1, which is characterized in that described
The concrete methods of realizing of step 2 is as follows:
The term vector matrix of step 2.1 Training document D, term vector matrix size are n × m, and n indicates the length of term vector matrix, m table
The width of term vector matrix;
Step 2.2 extracts the term vector matrix that the corresponding term vector of word each in context phrase is obtained from step 2.1
Come, obtains context phrase wd-l,wd-l+1,...,wdVector matrix M;
Step 2.3 calculates the semantic vector Context of context phrase, the moment of a vector obtained especially by step 2.2 using CNN
M and K layers of size of battle array is Cl×CmConvolution kernel operated;
Wherein, K indicates the number of convolution kernel, ClIndicate the length of convolution kernel, and Cl=l, CmIndicate the width of convolution kernel, and Cm=m;
The semantic vector Context of context phrase is calculated by formula (1):
1≤k≤K
Context=[Context1,Context2,...,ContextK]
Wherein, ContextkIndicate context phrase semantic vector kth dimension, l indicate context phrase length, m indicate word to
The width of moment matrix, i.e. term vector dimension, d indicate the initial position of first word in context phrase, cpqIt is convolution kernel pth row
With the weight parameter of q column, MpqIndicate the pth row and q column data of vector matrix M, b is the offset parameter of convolution kernel.
3. a kind of document subject matter vector abstracting method based on deep learning as described in claim 1, which is characterized in that described
The concrete methods of realizing of step 3.3 is as follows:
The forgetting door f of step 3.3.1 calculating t momentt, information is forgotten for controlling, and is calculated by formula (2);
ft=σ (Wfxt+Ufht-1+bf) (2)
Wherein, WfExpression parameter matrix, xtIndicate the term vector of t moment input, UfExpression parameter matrix, ht-1When indicating t-1
The hidden layer vector at quarter, bfIndicate bias vector parameter, as t=d-l, ht-1=hd-l-1, and hd-l-1For null vector, σ is indicated
Sigmoid function is the activation primitive of LSTM model;
The input gate i of step 3.3.2 calculating t momentt, new information to be added is needed for controlling current time, is passed through formula (3)
It calculates;
it=σ (Wixt+Uiht-1+bi) (3)
Wherein, WiExpression parameter matrix, xtIndicate the term vector of t moment input, UiExpression parameter matrix, ht-1When indicating t-1
The hidden layer vector at quarter, biIndicate bias vector parameter, it is the activation primitive of LSTM model that σ, which indicates Sigmoid function,;
Step 3.3.3 calculates the information that t moment updatesIt is calculated by formula (4);
Wherein,Expression parameter matrix, xtIndicate the term vector of t moment input,Expression parameter matrix, ht-1Indicate t-1
The hidden layer vector at moment,Indicate bias vector parameter, it is the activation letter of LSTM model that tanh, which indicates hyperbolic tangent function,
Number;
Step 3.3.4 calculates the information of t moment, and the information of last moment is added to obtain with the information that current time updates, and leads to
Cross formula (5) calculating;
Wherein, ctIndicate the information of t moment, ftIndicate that t moment forgets door, ct-1Indicate the information at t-1 moment, itIndicate t moment
Input gate,Indicate the information that t moment updates,Indicate the multiplication cross of vector;
The out gate o of step 3.3.5 calculating t momentt, for controlling input information, pass through formula (6) calculating:
ot=σ (Woxt+U0ht-1+bo) (6)
Wherein, WoExpression parameter matrix, xtIndicate the term vector of t moment input, U0Expression parameter matrix, ht-1When indicating t-1
The hidden layer vector at quarter, boIndicate bias vector parameter, it is the activation primitive of LSTM model that σ, which indicates Sigmoid function,;Wherein,
Parameter matrix Wf, Uf, Wi, Ui,Wo, UoMatrix element it is of different sizes, bias vector parameter bf, bi,boIn
Element size is different;
The hidden layer vector h of step 3.3.6 calculating t momentt, it is calculated by formula (7):
Wherein, otIndicate the out gate of t moment, ctIndicate the information of t moment.
4. a kind of document subject matter vector abstracting method based on deep learning as described in claim 1, which is characterized in that described
The concrete methods of realizing of step 5 is as follows:
Step 5.1 learns document subject matter mapping matrix, then according to document subject matter mapping matrix and docidEach document is reflected
Penetrate into-a length and step 2.1 in term vector matrix width it is equal-dimensional vector Dz;
The vector D that step 5.2 exports step 5.1zWith the average value of the context phrase semantic vector of step 4 outputIt is stitched together, obtains splicing vector Vd,
The V that step 5.3 is exported using step 5.2dTo predict target word wd+1;
Step 5.4 passes through the loss function of formula (12) calculating target function (11) using the method for cross entropy:
L=-log (P (y=wd+1|Vd)) (12)
Wherein, wd+1Indicate target word, VdIt is the splicing vector of step 4.2, log () indicates denary logarithm function;
Loss function (12) is carried out more by Sampled Softmax algorithm and small lot stochastic gradient descent parameter updating method
It is new to solve, obtain document subject matter vector.
5. a kind of document subject matter vector abstracting method based on deep learning as claimed in claim 4, which is characterized in that described
Step 5.3, classified by the method for logistic regression, objective function such as formula (11)
Wherein, θd+1It is target word wd+1The corresponding parameter in position, θiWord w in corresponding vocabularyiCorresponding parameter, | V | table
Show the size of vocabulary, VdIt is the splicing vector that step 5.2 obtains, exp indicates that, using e as the exponential function at bottom, Σ indicates summation;P
Indicate that probability, y indicate dependent variable, T representing matrix transposition.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810748564.1A CN108984526B (en) | 2018-07-10 | 2018-07-10 | Document theme vector extraction method based on deep learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810748564.1A CN108984526B (en) | 2018-07-10 | 2018-07-10 | Document theme vector extraction method based on deep learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108984526A true CN108984526A (en) | 2018-12-11 |
CN108984526B CN108984526B (en) | 2021-05-07 |
Family
ID=64536620
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810748564.1A Active CN108984526B (en) | 2018-07-10 | 2018-07-10 | Document theme vector extraction method based on deep learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108984526B (en) |
Cited By (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109871532A (en) * | 2019-01-04 | 2019-06-11 | 平安科技(深圳)有限公司 | Text subject extracting method, device and storage medium |
CN109933804A (en) * | 2019-03-27 | 2019-06-25 | 北京信息科技大学 | Merge the keyword abstraction method of subject information and two-way LSTM |
CN109960802A (en) * | 2019-03-19 | 2019-07-02 | 四川大学 | The information processing method and device of narrative text are reported for aviation safety |
CN110083710A (en) * | 2019-04-30 | 2019-08-02 | 北京工业大学 | It is a kind of that generation method is defined based on Recognition with Recurrent Neural Network and the word of latent variable structure |
CN110263343A (en) * | 2019-06-24 | 2019-09-20 | 北京理工大学 | The keyword abstraction method and system of phrase-based vector |
CN110334358A (en) * | 2019-04-28 | 2019-10-15 | 厦门大学 | A kind of phrase table dendrography learning method of context-aware |
CN110378409A (en) * | 2019-07-15 | 2019-10-25 | 昆明理工大学 | It is a kind of based on element association attention mechanism the Chinese get over news documents abstraction generating method |
CN110457674A (en) * | 2019-06-25 | 2019-11-15 | 西安电子科技大学 | A kind of text prediction method of theme guidance |
CN110472047A (en) * | 2019-07-15 | 2019-11-19 | 昆明理工大学 | A kind of Chinese of multiple features fusion gets over news viewpoint sentence abstracting method |
CN110532395A (en) * | 2019-05-13 | 2019-12-03 | 南京大学 | A kind of method for building up of the term vector improved model based on semantic embedding |
CN110766073A (en) * | 2019-10-22 | 2020-02-07 | 湖南科技大学 | Mobile application classification method for strengthening topic attention mechanism |
CN110781256A (en) * | 2019-08-30 | 2020-02-11 | 腾讯大地通途(北京)科技有限公司 | Method and device for determining POI (Point of interest) matched with Wi-Fi (Wireless Fidelity) based on transmitted position data |
CN110825848A (en) * | 2019-06-10 | 2020-02-21 | 北京理工大学 | Text classification method based on phrase vectors |
CN111125434A (en) * | 2019-11-26 | 2020-05-08 | 北京理工大学 | Relation extraction method and system based on ensemble learning |
CN111414483A (en) * | 2019-01-04 | 2020-07-14 | 阿里巴巴集团控股有限公司 | Document processing device and method |
CN111696624A (en) * | 2020-06-08 | 2020-09-22 | 天津大学 | DNA binding protein identification and function annotation deep learning method based on self-attention mechanism |
CN111753540A (en) * | 2020-06-24 | 2020-10-09 | 云南电网有限责任公司信息中心 | Method and system for collecting text data to perform Natural Language Processing (NLP) |
CN112597311A (en) * | 2020-12-28 | 2021-04-02 | 东方红卫星移动通信有限公司 | Terminal information classification method and system based on low-earth-orbit satellite communication |
CN112632966A (en) * | 2020-12-30 | 2021-04-09 | 绿盟科技集团股份有限公司 | Alarm information marking method, device, medium and equipment |
CN112685538A (en) * | 2020-12-30 | 2021-04-20 | 北京理工大学 | Text vector retrieval method combined with external knowledge |
CN112699662A (en) * | 2020-12-31 | 2021-04-23 | 太原理工大学 | False information early detection method based on text structure algorithm |
CN112966551A (en) * | 2021-01-29 | 2021-06-15 | 湖南科技学院 | Method and device for acquiring video frame description information and electronic equipment |
WO2021155705A1 (en) * | 2020-02-06 | 2021-08-12 | 支付宝(杭州)信息技术有限公司 | Text prediction model training method and apparatus |
CN115763167A (en) * | 2022-11-22 | 2023-03-07 | 黄华集团有限公司 | Solid cabinet breaker and control method thereof |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105069143A (en) * | 2015-08-19 | 2015-11-18 | 百度在线网络技术(北京)有限公司 | Method and device for extracting keywords from document |
CN106547735A (en) * | 2016-10-25 | 2017-03-29 | 复旦大学 | The structure and using method of the dynamic word or word vector based on the context-aware of deep learning |
CN106909537A (en) * | 2017-02-07 | 2017-06-30 | 中山大学 | A kind of polysemy analysis method based on topic model and vector space |
CN106919557A (en) * | 2017-02-22 | 2017-07-04 | 中山大学 | A kind of document vector generation method of combination topic model |
CN107092596A (en) * | 2017-04-24 | 2017-08-25 | 重庆邮电大学 | Text emotion analysis method based on attention CNNs and CCR |
CN107423282A (en) * | 2017-05-24 | 2017-12-01 | 南京大学 | Semantic Coherence Sexual Themes and the concurrent extracting method of term vector in text based on composite character |
CN107562792A (en) * | 2017-07-31 | 2018-01-09 | 同济大学 | A kind of question and answer matching process based on deep learning |
-
2018
- 2018-07-10 CN CN201810748564.1A patent/CN108984526B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105069143A (en) * | 2015-08-19 | 2015-11-18 | 百度在线网络技术(北京)有限公司 | Method and device for extracting keywords from document |
CN106547735A (en) * | 2016-10-25 | 2017-03-29 | 复旦大学 | The structure and using method of the dynamic word or word vector based on the context-aware of deep learning |
CN106909537A (en) * | 2017-02-07 | 2017-06-30 | 中山大学 | A kind of polysemy analysis method based on topic model and vector space |
CN106919557A (en) * | 2017-02-22 | 2017-07-04 | 中山大学 | A kind of document vector generation method of combination topic model |
CN107092596A (en) * | 2017-04-24 | 2017-08-25 | 重庆邮电大学 | Text emotion analysis method based on attention CNNs and CCR |
CN107423282A (en) * | 2017-05-24 | 2017-12-01 | 南京大学 | Semantic Coherence Sexual Themes and the concurrent extracting method of term vector in text based on composite character |
CN107562792A (en) * | 2017-07-31 | 2018-01-09 | 同济大学 | A kind of question and answer matching process based on deep learning |
Non-Patent Citations (2)
Title |
---|
GUANGXU XUN 等: "Topic Discovery for Short Texts Using Word Embeddings", 《2016 IEEE 16TH INTERNATIONAL CONFERENCE ON DATA MINING》 * |
胡朝举 等: "基于词向量技术和混合神经网络的情感分析", 《计算机应用研究》 * |
Cited By (41)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111414483A (en) * | 2019-01-04 | 2020-07-14 | 阿里巴巴集团控股有限公司 | Document processing device and method |
WO2020140633A1 (en) * | 2019-01-04 | 2020-07-09 | 平安科技(深圳)有限公司 | Text topic extraction method, apparatus, electronic device, and storage medium |
CN109871532A (en) * | 2019-01-04 | 2019-06-11 | 平安科技(深圳)有限公司 | Text subject extracting method, device and storage medium |
CN111414483B (en) * | 2019-01-04 | 2023-03-28 | 阿里巴巴集团控股有限公司 | Document processing device and method |
CN109960802A (en) * | 2019-03-19 | 2019-07-02 | 四川大学 | The information processing method and device of narrative text are reported for aviation safety |
CN109933804A (en) * | 2019-03-27 | 2019-06-25 | 北京信息科技大学 | Merge the keyword abstraction method of subject information and two-way LSTM |
CN110334358A (en) * | 2019-04-28 | 2019-10-15 | 厦门大学 | A kind of phrase table dendrography learning method of context-aware |
CN110083710B (en) * | 2019-04-30 | 2021-04-02 | 北京工业大学 | Word definition generation method based on cyclic neural network and latent variable structure |
CN110083710A (en) * | 2019-04-30 | 2019-08-02 | 北京工业大学 | It is a kind of that generation method is defined based on Recognition with Recurrent Neural Network and the word of latent variable structure |
CN110532395A (en) * | 2019-05-13 | 2019-12-03 | 南京大学 | A kind of method for building up of the term vector improved model based on semantic embedding |
CN110532395B (en) * | 2019-05-13 | 2021-09-28 | 南京大学 | Semantic embedding-based word vector improvement model establishing method |
CN110825848A (en) * | 2019-06-10 | 2020-02-21 | 北京理工大学 | Text classification method based on phrase vectors |
CN110825848B (en) * | 2019-06-10 | 2022-08-09 | 北京理工大学 | Text classification method based on phrase vectors |
CN110263343B (en) * | 2019-06-24 | 2021-06-15 | 北京理工大学 | Phrase vector-based keyword extraction method and system |
CN110263343A (en) * | 2019-06-24 | 2019-09-20 | 北京理工大学 | The keyword abstraction method and system of phrase-based vector |
CN110457674A (en) * | 2019-06-25 | 2019-11-15 | 西安电子科技大学 | A kind of text prediction method of theme guidance |
CN110472047A (en) * | 2019-07-15 | 2019-11-19 | 昆明理工大学 | A kind of Chinese of multiple features fusion gets over news viewpoint sentence abstracting method |
CN110472047B (en) * | 2019-07-15 | 2022-12-13 | 昆明理工大学 | Multi-feature fusion Chinese-Yue news viewpoint sentence extraction method |
CN110378409A (en) * | 2019-07-15 | 2019-10-25 | 昆明理工大学 | It is a kind of based on element association attention mechanism the Chinese get over news documents abstraction generating method |
CN110781256A (en) * | 2019-08-30 | 2020-02-11 | 腾讯大地通途(北京)科技有限公司 | Method and device for determining POI (Point of interest) matched with Wi-Fi (Wireless Fidelity) based on transmitted position data |
CN110781256B (en) * | 2019-08-30 | 2024-02-23 | 腾讯大地通途(北京)科技有限公司 | Method and device for determining POI matched with Wi-Fi based on sending position data |
CN110766073A (en) * | 2019-10-22 | 2020-02-07 | 湖南科技大学 | Mobile application classification method for strengthening topic attention mechanism |
CN110766073B (en) * | 2019-10-22 | 2023-10-27 | 湖南科技大学 | Mobile application classification method for strengthening topic attention mechanism |
CN111125434A (en) * | 2019-11-26 | 2020-05-08 | 北京理工大学 | Relation extraction method and system based on ensemble learning |
CN111125434B (en) * | 2019-11-26 | 2023-06-27 | 北京理工大学 | Relation extraction method and system based on ensemble learning |
WO2021155705A1 (en) * | 2020-02-06 | 2021-08-12 | 支付宝(杭州)信息技术有限公司 | Text prediction model training method and apparatus |
CN111696624B (en) * | 2020-06-08 | 2022-07-12 | 天津大学 | DNA binding protein identification and function annotation deep learning method based on self-attention mechanism |
CN111696624A (en) * | 2020-06-08 | 2020-09-22 | 天津大学 | DNA binding protein identification and function annotation deep learning method based on self-attention mechanism |
CN111753540A (en) * | 2020-06-24 | 2020-10-09 | 云南电网有限责任公司信息中心 | Method and system for collecting text data to perform Natural Language Processing (NLP) |
CN111753540B (en) * | 2020-06-24 | 2023-04-07 | 云南电网有限责任公司信息中心 | Method and system for collecting text data to perform Natural Language Processing (NLP) |
CN112597311B (en) * | 2020-12-28 | 2023-07-11 | 东方红卫星移动通信有限公司 | Terminal information classification method and system based on low-orbit satellite communication |
CN112597311A (en) * | 2020-12-28 | 2021-04-02 | 东方红卫星移动通信有限公司 | Terminal information classification method and system based on low-earth-orbit satellite communication |
CN112632966A (en) * | 2020-12-30 | 2021-04-09 | 绿盟科技集团股份有限公司 | Alarm information marking method, device, medium and equipment |
CN112685538B (en) * | 2020-12-30 | 2022-10-14 | 北京理工大学 | Text vector retrieval method combined with external knowledge |
CN112632966B (en) * | 2020-12-30 | 2023-07-21 | 绿盟科技集团股份有限公司 | Alarm information marking method, device, medium and equipment |
CN112685538A (en) * | 2020-12-30 | 2021-04-20 | 北京理工大学 | Text vector retrieval method combined with external knowledge |
CN112699662B (en) * | 2020-12-31 | 2022-08-16 | 太原理工大学 | False information early detection method based on text structure algorithm |
CN112699662A (en) * | 2020-12-31 | 2021-04-23 | 太原理工大学 | False information early detection method based on text structure algorithm |
CN112966551A (en) * | 2021-01-29 | 2021-06-15 | 湖南科技学院 | Method and device for acquiring video frame description information and electronic equipment |
CN115763167A (en) * | 2022-11-22 | 2023-03-07 | 黄华集团有限公司 | Solid cabinet breaker and control method thereof |
CN115763167B (en) * | 2022-11-22 | 2023-09-22 | 黄华集团有限公司 | Solid cabinet circuit breaker and control method thereof |
Also Published As
Publication number | Publication date |
---|---|
CN108984526B (en) | 2021-05-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108984526A (en) | A kind of document subject matter vector abstracting method based on deep learning | |
CN109472024B (en) | Text classification method based on bidirectional circulation attention neural network | |
CN109241524B (en) | Semantic analysis method and device, computer-readable storage medium and electronic equipment | |
CN108595632B (en) | Hybrid neural network text classification method fusing abstract and main body characteristics | |
CN104834747B (en) | Short text classification method based on convolutional neural networks | |
CN111241294B (en) | Relationship extraction method of graph convolution network based on dependency analysis and keywords | |
CN111738003B (en) | Named entity recognition model training method, named entity recognition method and medium | |
CN109753660B (en) | LSTM-based winning bid web page named entity extraction method | |
CN110134946B (en) | Machine reading understanding method for complex data | |
CN113011533A (en) | Text classification method and device, computer equipment and storage medium | |
CN108376131A (en) | Keyword abstraction method based on seq2seq deep neural network models | |
CN110717332B (en) | News and case similarity calculation method based on asymmetric twin network | |
CN113392209B (en) | Text clustering method based on artificial intelligence, related equipment and storage medium | |
CN110688862A (en) | Mongolian-Chinese inter-translation method based on transfer learning | |
Cai et al. | Intelligent question answering in restricted domains using deep learning and question pair matching | |
CN108733647B (en) | Word vector generation method based on Gaussian distribution | |
CN111581967B (en) | News theme event detection method combining LW2V with triple network | |
CN112069312B (en) | Text classification method based on entity recognition and electronic device | |
CN110619121A (en) | Entity relation extraction method based on improved depth residual error network and attention mechanism | |
CN113255366B (en) | Aspect-level text emotion analysis method based on heterogeneous graph neural network | |
CN115392252A (en) | Entity identification method integrating self-attention and hierarchical residual error memory network | |
CN114818717A (en) | Chinese named entity recognition method and system fusing vocabulary and syntax information | |
CN114417851A (en) | Emotion analysis method based on keyword weighted information | |
CN110569355B (en) | Viewpoint target extraction and target emotion classification combined method and system based on word blocks | |
CN106610949A (en) | Text feature extraction method based on semantic analysis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |