CN108984526A - A kind of document subject matter vector abstracting method based on deep learning - Google Patents

A kind of document subject matter vector abstracting method based on deep learning Download PDF

Info

Publication number
CN108984526A
CN108984526A CN201810748564.1A CN201810748564A CN108984526A CN 108984526 A CN108984526 A CN 108984526A CN 201810748564 A CN201810748564 A CN 201810748564A CN 108984526 A CN108984526 A CN 108984526A
Authority
CN
China
Prior art keywords
vector
indicate
moment
context
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810748564.1A
Other languages
Chinese (zh)
Other versions
CN108984526B (en
Inventor
高扬
黄河燕
陆池
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN201810748564.1A priority Critical patent/CN108984526B/en
Publication of CN108984526A publication Critical patent/CN108984526A/en
Application granted granted Critical
Publication of CN108984526B publication Critical patent/CN108984526B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The present invention relates to a kind of document subject matter vector abstracting method based on deep learning, belongs to natural language processing technique field.The method of the present invention extracts the semantic information of the deep layer with part using convolutional neural networks, timing information is learnt out using LSTM model, so that the semanteme of vector is more comprehensive, select the implicit cooccurrence relation of context phrase and document subject matter, avoid the shortcomings that some theme vector models based on sentence are for short text, CNN and LSTM model is organically combined using attention mechanism, the Deep Semantics, timing information and significant information for having learnt context more effectively construct the model that grade theme vector extracts.

Description

A kind of document subject matter vector abstracting method based on deep learning
Technical field
The present invention relates to a kind of document subject matter vector abstracting method based on deep learning, belongs to natural language processing technique Field.
Background technique
In big data era of today, how to find that the theme of magnanimity internet text notebook data is-a research emphasis.To text The theme of notebook data is analyzed, and document subject matter vector is substantially the Deep Semantics for indicating document, is theme and semantic interior It is combining.Extracting document subject matter vector can be widely used in natural language processing task, including social networks and new The analysis of public opinion of media, the timely acquisition of hot news etc..Therefore, how efficiently to extract document subject matter vector be-it is a Important subject.
For text data, theme might not be embodied directly on specific word content, this allows for digging The implicit theme of pick text becomes difficult, and needing to extract document according to relationships such as the word of text, sentence, paragraphs is included Thematic meaning, and combine the chapter relationship of document to extract the theme of document.In recent years with statistics natural language at Reason method and corpus it is abundant, the text subject modeling method based on " word-theme " " document-theme " is also mentioned in succession Out, basic thought is the assumption that the theme of each word and document is to obey a statistical probability distribution, by corpus number According to training, calculate the probability distribution of its document subject matter, then clustered further according to this document subject matter.
Want Correct Analysis to go out the theme of each document, conventional method be subject analysis is carried out to each word of text, but It is that this method has one very big: really determines that the word of text subject only accounts for few portion of text word in fact Point, therefore conventional method can largely analyze the word unrelated with theme, the unrelated word of this aspect causes to realize It is next computationally intensive, it is on the other hand inaccurate there is also being extracted for text subject, text internal association degree relationship cannot be combined The problem of excavating text Deep Semantics.
With the promotion of hardware performance and the continuous expansion of data scale, deep learning is also widely used in each neck Among domain, experimental result is significantly improved on its original base.Deep learning is with its graceful model, flexible framework etc. Feature had obtained extensive utilization in recent years in conjunction in the method for word Embedding and document Embedding.All In deep learning method, CNN (Convolutional Neural Network, convolutional neural networks) and LSTM model (Long Short-Term Memory, shot and long term memory network model) it is two of most mainstream.In natural language processing task, it is based on The text analyzing method of CNN and LSTM model can be good at finding the potential applications feature of text, in semantic analysis calculating The natural language processings task such as automatic abstract, sentiment analysis, machine translation is given greatly to help.
Summary of the invention
The purpose of the invention is to overcome the deficiencies of existing technologies, solve how text internal association degree relationship to be combined to dig The problem of digging text Deep Semantics, proposes a kind of document subject matter vector abstracting method based on deep learning.The present invention is document Theme vector modeling more focuses in the analysis to document compression, excavates text feature and theme vector is hidden The correlation contained, to learn document subject matter vector.
The core idea of the invention is as follows: the semanteme of context phrase is extracted using CNN, and the semanteme extracted is input to In LSTM model, the different location of text and the importance of different meaning words are extracted using attention mechanism, to remain Important information also completes the combination of CNN and LSTM model, excavates the internal association between context, has learnt tool There are Deep Semantics and significant document subject matter vector.
The method of the present invention is achieved through the following technical solutions.
A kind of document subject matter vector abstracting method based on deep learning, comprising the following steps:
Step 1: related definition is carried out, it is specific as follows:
Definition 1: document D, D=[w1,w2,...,wi,...,wn], wiIndicate i-th of word of document D;
Define 2: prediction word wd+1, indicate the target word for needing to learn;
Define 3: window word is made of the words continuously occurred several in text, between window word exist hide Internal association;
Definition 4: context phrase, the window word that expression prediction word position occurs before, length of window l, Context phrase is denoted as wd-l,wd-l+1,...,wd
Define 5: document subject matter mapping matrix is learnt by LDA algorithm (Latent Dirichlet Allocation) It arrives, every a line represents the theme of a document;
Define 6:NdAnd docid, NdIndicate the number of document in corpus, docidIndicate the position of document;Each document pair Answer a unique docid, wherein 1≤docid≤Nd
Step 2: study obtains the semantic vector of context phrase using CNN.
Step 3: obtaining hidden layer vector h using the semanteme of LSTM model learning context phrased-l,hd-l+1,..., hd
Step 4: CNN and LSTM model is organically combined by attention mechanism, obtain context phrase semantic vector Average value
Step 5: utilizing the average value of context phrase semantic vector by the method for logistic regressionAnd text Shelves subject information predicts target word wd+1, obtain target word wd+1Prediction probability.
Beneficial effect
A kind of document subject matter vector abstracting method based on deep learning of the present invention, compares the prior art, and have has as follows Beneficial effect:
1. extracting the semantic information of the deep layer with part using CNN;
2. timing information is learnt out using LSTM model, so that the semanteme of vector is more comprehensive;
3. select context phrase and document subject matter implicit cooccurrence relation, avoid it is some based on the theme of sentence to Measure the shortcomings that model is for short text;
4. CNN and LSTM model is organically combined using attention mechanism, learnt context Deep Semantics, Timing information and significant information more effectively construct the model that grade theme vector extracts.
Detailed description of the invention
Fig. 1 is a kind of flow chart of the document subject matter vector abstracting method based on deep learning of the present invention.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, below according to accompanying drawings and embodiments pair Abstract method of the present invention is further described.
A kind of document subject matter vector abstracting method based on deep learning, basic implementation process are as follows:
Step 1: related definition is carried out, it is specific as follows:
Definition 1: document D, D=[w1,w2,...,wi,...,wn], wiIndicate i-th of word of document D;
Define 2: prediction word wd+1;, indicate the target word for needing to learn;
Define 3: window word is made of the words continuously occurred several in text, between window word exist hide Internal association;
Define 4: context phrase (wd-l,wd-l+1,...,wd), indicate the window occurred before prediction word position Word, context phrase length are l;
Define 5: document subject matter mapping matrix learns to obtain by LDA algorithm, and every a line represents the theme of a document;
Define 6:NdAnd docid, NdIndicate the number of document in corpus, docidIndicate the position of document;Each document pair Answer a unique docid, wherein 1≤docid≤Nd
Step 2: study obtains the semantic vector Context of context phrase using CNN.
The specific implementation process is as follows:
Step 2.1 utilizes word2vec scheduling algorithm, the term vector matrix of Training document D, and term vector matrix size is n × m, N indicates the length of term vector matrix, the width of m table term vector matrix;
The term vector matrix that step 2.2 obtains the corresponding term vector of word each in context phrase from step 2.1 is taken out It takes out, to obtain context phrase wd-l,wd-l+1,...,wdVector matrix M;
Step 2.3 calculates the semantic vector Context of context phrase using CNN.Especially by step 2.2 obtain to M and K layers of size of moment matrix are Cl×CmConvolution kernel operated;
Wherein, K indicates the number of convolution kernel, and K is equal to 128, C in present embodimentlIndicate the length of convolution kernel, and Cl =l, CmIndicate the width of convolution kernel, and Cm=m;
The semantic vector Context of context phrase is calculated by formula (1):
1≤k≤K
Context=[Context1,Context2,...,ContextK]
Wherein, ContextkIndicate the kth dimension of the semantic vector of context phrase, l indicates context phrase length, m table Show that the width of term vector matrix, i.e. term vector dimension, d indicate the initial position of first word in context phrase, cpqIt is convolution The weight parameter of core pth row and q column, MpqIndicate that the pth row and q column data of vector matrix M, b are the biasing ginsengs of convolution kernel Number;
Step 3: obtaining hidden layer vector h using the semanteme of LSTM model learning context phrased-l,hd-l+1,..., hd
The specific implementation process is as follows:
T assignment d-l, i.e. t=d-l, t are indicated t moment by step 3.1;
Step 3.2 is by xtAssignment wtTerm vector, xtIndicate the term vector of t moment input, wtIndicate that t moment inputs Word;
Wherein, wtThe term vector matrix that is exported by step 2.1 of term vector map to obtain, i.e. extraction wtIn vector matrix M The term vector of corresponding position;
Step 3.3 is by xtAs the input of LSTM model, the hidden layer vector h of t moment is obtainedt
The specific implementation process is as follows:
The forgetting door f of step 3.3.1 calculating t momentt, information is forgotten for controlling, and is calculated by formula (2);
ft=σ (Wfxt+Ufht-1+bf) (2)
Wherein, WfExpression parameter matrix, xtIndicate the term vector of t moment input, UfExpression parameter matrix, ht-1Indicate t- The hidden layer vector at 1 moment, bfIndicate bias vector parameter, as t=d-l, ht-1=hd-l-1, and hd-l-1For null vector, σ table Show Sigmoid function, is the activation primitive of LSTM model;
The input gate i of step 3.3.2 calculating t momentt, new information to be added is needed for controlling current time, passes through public affairs Formula (3) calculates;
it=σ (Wixt+Uiht-1+bi) (3)
Wherein, WiExpression parameter matrix, xtIndicate the term vector of t moment input, UiExpression parameter matrix, ht-1Indicate t- The hidden layer vector at 1 moment, biIndicate bias vector parameter, it is the activation primitive of LSTM model that σ, which indicates Sigmoid function,;
Step 3.3.3 calculates the information that t moment updatesIt is calculated by formula (4);
Wherein,Expression parameter matrix, xtIndicate the term vector of t moment input,Expression parameter matrix, ht-1It indicates The hidden layer vector at t-1 moment,Indicate bias vector parameter, it is the activation of LSTM model that tanh, which indicates hyperbolic tangent function, Function;
Step 3.3.4 calculates the information of t moment, and the information that the information of last moment is updated with current time is added It arrives, is calculated by formula (5);
Wherein, ctIndicate the information of t moment, ftIndicate that t moment forgets door, ct-1Indicate the information at t-1 moment, itIndicate t The input gate at moment,Indicate the information that t moment updates, the multiplication cross of ° expression vector;
The out gate o of step 3.3.5 calculating t momentt, for controlling input information, pass through formula (6) calculating:
ot=σ (Woxt+U0ht-1+bo) (6)
Wherein, WoExpression parameter matrix, xtIndicate the term vector of t moment input, U0Expression parameter matrix, ht-1Indicate t- The hidden layer vector at 1 moment, boIndicate bias vector parameter, it is the activation primitive of LSTM model that σ, which indicates Sigmoid function,;Its In, the parameter matrix W in step 3.3.1-3.3.3 and step 3.3.5f, Uf, Wi, Ui,Wo, UoMatrix element it is big Small difference, bias vector parameter bf, bi,boIn element size it is different;The hidden layer vector of step 3.3.6 calculating t moment ht, it is calculated by formula (7):
ht=otοct (7)
Wherein, otIndicate the out gate of t moment, ctIndicate the information of t moment;
Step 3.4 judges whether t is equal to d, and t adds 1 if being not equal to, leapfrog rapid 3.2;If being equal to, export hidden layer to Measure hd-l,hd-l+1,...,hd, jump into step 4;
Step 4:, by CNN and LSTM models coupling, obtaining the flat of context phrase semantic vector using attention mechanism Mean valueThe specific implementation process is as follows:
The context phrase semantic vector that step 4.1 is obtained using step 2, obtains each word by attention mechanism Importance factor α on the semantic vector of context phrase is calculated especially by formula (8):
d-l≤t≤d
α=[αd-ld-l+1,...,αd] (8)
Wherein, αtIndicate importance factor of the t moment word on the semantic vector of context phrase, Context is indicated The semantic vector of the context phrase obtained in step 2, xtIndicate the term vector of t moment input, xiIndicate that the i-th moment inputted Term vector;The transposition of T expression vector;E indicates with e, i.e. the natural constant exponential function that is bottom;
Step 4.2 calculates the hidden layer vector h ' based on attention mechanism Weight, is calculated by formula (9);
h′tt*ht
d-l≤t≤d
H '=[h 'd-l,h′d-l+1,...,h′d] (9)
Wherein, h 'tIndicate t moment weight hidden layer vector h 't, αtIndicate each word of t moment in the language of context phrase Importance factor on adopted vector, htIndicate t moment hidden layer vector;
Step 4.3 is operated using mean-pooling, calculates the average value of context phrase semantic vector It is calculated by formula (10):
Wherein, h 'tIndicate t moment weight hidden layer vector h 't
Step 5: being believed by the method for logistic regression using the average value and document subject matter of context phrase semantic vector Breath prediction target word wd+1, obtain target word wd+1Prediction probability.The specific implementation process is as follows:
Step 5.1 learns document subject matter mapping matrix using LDA algorithm, then according to document subject matter mapping matrix and docid By each document be mapped to term vector matrix width in a length and step 2.1 it is equal-dimensional vector Dz
The vector D that step 5.2 exports step 5.1zWith the average value of the context phrase semantic vector of step 4 outputIt is stitched together, obtains splicing vector Vd,
The V that step 5.3 is exported using step 5.2dTo predict target word wd+1.Especially by logistic regression method into Row classification, objective function such as formula (11)
Wherein, θd+1It is target word wd+1The corresponding parameter in position, θiWord w in corresponding vocabularyiCorresponding parameter, | V | indicate the size of vocabulary, VdIt is the splicing vector that step 5.2 obtains, exp indicates that using e as the exponential function at bottom, Σ expression is asked With;P indicates that probability, y indicate dependent variable, T representing matrix transposition.
Step 5.4 passes through the loss function of formula (12) calculating target function (11) using the method for cross entropy:
L=-log (P (y=wd+1|Vd)) (12)
Wherein, wd+1Indicate target word, VdIt is the splicing vector of step 4.2, log () indicates denary logarithm letter Number;
Loss function (12) by Sampled Softmax algorithm and small lot stochastic gradient descent parameter updating method into Row, which updates, to be solved, and document subject matter vector is obtained.
So far, it from step 1 to step 5, completes with Deep Semantics and significant document subject matter vector.
Embodiment
The present embodiment describes specific implementation process of the invention, as shown in Figure 1.
It will be seen from figure 1 that a kind of process of the document subject matter vector abstracting method based on deep learning of the present invention is as follows:
Step A pretreatment;The meaningless symbol in corpus, such as spcial character are got rid of first, and then text is carried out Participle.Participle is exactly the process that continuous word sequence is divided into individual word according to set morphological rule, thus will Sentences decomposition is that several continuous significant word strings are used for subsequent analysis.Participle operation is divided using PTB segmenter Word processing.After participle, vocabulary is constructed to original text, in the present embodiment, what vocabulary was chosen is the word of training text Frequency preceding 20000 words from high in the end, that is, the size of vocabulary V is 20000.After vocabulary selection, according to the rope of vocabulary Draw the vocabulary index data for constructing original language material, this input of text vocabulary index data as model.
Step B learns term vector using word2vec algorithm.Word in document is input in word2vec algorithm, is obtained Term vector, objective function such as formula (13):
Wherein, k is window word, and i is current word, and Corp is word size in corpus, utilizes gradient descent method Study obtains the term vector of 128 dimensions;
Step C extracts context phrase semantic vector using CNN, learns context phrase hidden layer vector using RNN;
Wherein, context phrase semantic vector is extracted using CNN, is using RNN study context phrase hidden layer vector And column count, specific to the present embodiment:
Context phrase semantic vector is extracted using CNN;One K layers of random initializtion are carried out first with Gaussian Profile Size is Cl×CmConvolution kernel, for given context phrase wd-l,wd-l+1,...,wd, the term vector acquired by step B These context phrases are mapped to the matrix that size is l × m, wherein l is the length of context phrase, and m is the dimension of term vector The matrix is carried out convolution operation, shown in concrete operations mode such as formula (1), in this way by degree on the convolution kernel of random initializtion A vector Context is just obtained, which is exactly the semantic vector of context phrase;
Learn context phrase hidden layer vector using RNN;By context phrase wd-l,wd-l+1,...,wdCorresponding word to Amount is successively inputted in LSTM model, by the hidden layer vector h at 0 moment0Each dimension be set as 0, then using public affairs Formula (2)-(7), which successively calculate, forgets door, input gate, out gate and final result context phrase hidden layer vector, dimension size It is set as 128;
Step D is calculated the semantic vector of Weight using attention mechanism, calculates document subject matter distribution;
Wherein, calculating the semantic vector of Weight using attention mechanism, calculate document subject matter distribution is simultaneously column count, Specific to the present embodiment:
The semantic vector of Weight is calculated using attention mechanism;It is obtained according to the obtained term vector of step B and step C The semantic vector of context phrase carries out the operation of attention mechanism, attention getting for each of context phrase word Power factor-alphat, αtIt is a real number between 0 to 1, number is bigger, then the term vector information of its corresponding position will be got over More is retained in last mean-pooling layer, therefore its size illustrates that current term is characterizing entire phrase meaning Importance, that is to say, that more important word will more be paid attention to;
Calculate document subject matter distribution;It is specifically calculated using LDA algorithm, document D is input in LDA algorithm first, is obtained To the theme distribution of each document D, which is denoted as D directly as final resultz
Step E predicts target word, learns document subject matter vector;By the semantic vector and D of WeightzDirect splicing rises Come, the maximum probability for target word occur is joined by Sampled Softmax algorithm and small lot stochastic gradient descent Number update method can acquire document subject matter vector.

Claims (5)

1. a kind of document subject matter vector abstracting method based on deep learning, which comprises the following steps:
Step 1: related definition is carried out, it is specific as follows:
Definition 1: document D, D=[w1,w2,...,wi,...,wn], wiIndicate i-th of word of document D;
Define 2: prediction word wd+1;, indicate the target word for needing to learn;
Define 3: window word is made of the word continuously occurred in text, there is hiding internal association between window word;
Definition 4: context phrase: wd-l,wd-l+1,...,wd, indicate the window word occurred before prediction word position, Context phrase length is l;
Define 5: document subject matter mapping matrix learns to obtain by LDA algorithm, and every a line represents the theme of a document;
Define 6:NdAnd docid, NdIndicate the number of document in corpus, docidIndicate the position of document;Each document is corresponding only One docid, wherein 1≤docid≤Nd
Step 2: study obtains the semantic vector of context phrase using convolutional neural networks CNN;
Step 3: obtaining hidden layer vector h using the semanteme of shot and long term memory network model LSTM study context phrased-l, hd-l+1..., hd;The specific implementation process is as follows:
T assignment d-l, i.e. t=d-l, t are indicated t moment by step 3.1;
Step 3.2 is by xtAssignment wtTerm vector, xtIndicate the term vector of t moment input, wtIndicate the list of t moment input Word;
Wherein, wtThe term vector matrix that is exported by step 2.1 of term vector map to obtain, i.e. extraction wtIt is corresponding in vector matrix M The term vector of position;
Step 3.3 is by xtAs the input of LSTM model, the hidden layer vector h of t moment is obtainedt
Step 3.4 judges whether t is equal to d, and t adds 1 if being not equal to, leapfrog rapid 3.2;If being equal to, hidden layer vector is exported hd-l,hd-l+1,...,hd, jump into step 4;
Step 4: CNN and LSTM model is organically combined by attention mechanism, obtain the flat of context phrase semantic vector Mean valueConcrete methods of realizing is as follows:
The context phrase semantic vector that step 4.1 is obtained using step 2 obtains each word upper by attention mechanism The hereafter importance factor α on the semantic vector of phrase is calculated especially by following formula:
d-l≤t≤d
α=[αd-ld-l+1,...,αd]
Wherein, αtIndicate importance factor of the t moment word on the semantic vector of context phrase, Context indicates step 2 The semantic vector of the context phrase of middle acquisition, xtIndicate the term vector of t moment input, xiIndicate the i-th moment input word to Amount;The transposition of T expression vector;E indicates with e, i.e. the natural constant exponential function that is bottom;
Step 4.2 calculates the hidden layer vector h ' based on attention mechanism Weight, is calculated by following formula;
ht'=αt*ht
d-l≤t≤d
H '=[h 'd-l,h′d-l+1,...,hd′]
Wherein, ht' indicate t moment weight hidden layer vector ht', αtIndicate each word of t moment context phrase it is semantic to Importance factor in amount, htIndicate t moment hidden layer vector;
Step 4.3 is operated using mean-pooling, calculates the average value of context phrase semantic vectorPass through Following formula (10) calculates:
Wherein, ht' indicate t moment weight hidden layer vector ht′;
Step 5: utilizing the average value of context phrase semantic vector by the method for logistic regressionWith document master Inscribe information prediction target word wd+1, obtain target word wd+1Prediction probability.
2. a kind of document subject matter vector abstracting method based on deep learning as described in claim 1, which is characterized in that described The concrete methods of realizing of step 2 is as follows:
The term vector matrix of step 2.1 Training document D, term vector matrix size are n × m, and n indicates the length of term vector matrix, m table The width of term vector matrix;
Step 2.2 extracts the term vector matrix that the corresponding term vector of word each in context phrase is obtained from step 2.1 Come, obtains context phrase wd-l,wd-l+1,...,wdVector matrix M;
Step 2.3 calculates the semantic vector Context of context phrase, the moment of a vector obtained especially by step 2.2 using CNN M and K layers of size of battle array is Cl×CmConvolution kernel operated;
Wherein, K indicates the number of convolution kernel, ClIndicate the length of convolution kernel, and Cl=l, CmIndicate the width of convolution kernel, and Cm=m;
The semantic vector Context of context phrase is calculated by formula (1):
1≤k≤K
Context=[Context1,Context2,...,ContextK]
Wherein, ContextkIndicate context phrase semantic vector kth dimension, l indicate context phrase length, m indicate word to The width of moment matrix, i.e. term vector dimension, d indicate the initial position of first word in context phrase, cpqIt is convolution kernel pth row With the weight parameter of q column, MpqIndicate the pth row and q column data of vector matrix M, b is the offset parameter of convolution kernel.
3. a kind of document subject matter vector abstracting method based on deep learning as described in claim 1, which is characterized in that described The concrete methods of realizing of step 3.3 is as follows:
The forgetting door f of step 3.3.1 calculating t momentt, information is forgotten for controlling, and is calculated by formula (2);
ft=σ (Wfxt+Ufht-1+bf) (2)
Wherein, WfExpression parameter matrix, xtIndicate the term vector of t moment input, UfExpression parameter matrix, ht-1When indicating t-1 The hidden layer vector at quarter, bfIndicate bias vector parameter, as t=d-l, ht-1=hd-l-1, and hd-l-1For null vector, σ is indicated Sigmoid function is the activation primitive of LSTM model;
The input gate i of step 3.3.2 calculating t momentt, new information to be added is needed for controlling current time, is passed through formula (3) It calculates;
it=σ (Wixt+Uiht-1+bi) (3)
Wherein, WiExpression parameter matrix, xtIndicate the term vector of t moment input, UiExpression parameter matrix, ht-1When indicating t-1 The hidden layer vector at quarter, biIndicate bias vector parameter, it is the activation primitive of LSTM model that σ, which indicates Sigmoid function,;
Step 3.3.3 calculates the information that t moment updatesIt is calculated by formula (4);
Wherein,Expression parameter matrix, xtIndicate the term vector of t moment input,Expression parameter matrix, ht-1Indicate t-1 The hidden layer vector at moment,Indicate bias vector parameter, it is the activation letter of LSTM model that tanh, which indicates hyperbolic tangent function, Number;
Step 3.3.4 calculates the information of t moment, and the information of last moment is added to obtain with the information that current time updates, and leads to Cross formula (5) calculating;
Wherein, ctIndicate the information of t moment, ftIndicate that t moment forgets door, ct-1Indicate the information at t-1 moment, itIndicate t moment Input gate,Indicate the information that t moment updates,Indicate the multiplication cross of vector;
The out gate o of step 3.3.5 calculating t momentt, for controlling input information, pass through formula (6) calculating:
ot=σ (Woxt+U0ht-1+bo) (6)
Wherein, WoExpression parameter matrix, xtIndicate the term vector of t moment input, U0Expression parameter matrix, ht-1When indicating t-1 The hidden layer vector at quarter, boIndicate bias vector parameter, it is the activation primitive of LSTM model that σ, which indicates Sigmoid function,;Wherein, Parameter matrix Wf, Uf, Wi, Ui,Wo, UoMatrix element it is of different sizes, bias vector parameter bf, bi,boIn Element size is different;
The hidden layer vector h of step 3.3.6 calculating t momentt, it is calculated by formula (7):
Wherein, otIndicate the out gate of t moment, ctIndicate the information of t moment.
4. a kind of document subject matter vector abstracting method based on deep learning as described in claim 1, which is characterized in that described The concrete methods of realizing of step 5 is as follows:
Step 5.1 learns document subject matter mapping matrix, then according to document subject matter mapping matrix and docidEach document is reflected Penetrate into-a length and step 2.1 in term vector matrix width it is equal-dimensional vector Dz
The vector D that step 5.2 exports step 5.1zWith the average value of the context phrase semantic vector of step 4 outputIt is stitched together, obtains splicing vector Vd,
The V that step 5.3 is exported using step 5.2dTo predict target word wd+1
Step 5.4 passes through the loss function of formula (12) calculating target function (11) using the method for cross entropy:
L=-log (P (y=wd+1|Vd)) (12)
Wherein, wd+1Indicate target word, VdIt is the splicing vector of step 4.2, log () indicates denary logarithm function;
Loss function (12) is carried out more by Sampled Softmax algorithm and small lot stochastic gradient descent parameter updating method It is new to solve, obtain document subject matter vector.
5. a kind of document subject matter vector abstracting method based on deep learning as claimed in claim 4, which is characterized in that described Step 5.3, classified by the method for logistic regression, objective function such as formula (11)
Wherein, θd+1It is target word wd+1The corresponding parameter in position, θiWord w in corresponding vocabularyiCorresponding parameter, | V | table Show the size of vocabulary, VdIt is the splicing vector that step 5.2 obtains, exp indicates that, using e as the exponential function at bottom, Σ indicates summation;P Indicate that probability, y indicate dependent variable, T representing matrix transposition.
CN201810748564.1A 2018-07-10 2018-07-10 Document theme vector extraction method based on deep learning Active CN108984526B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810748564.1A CN108984526B (en) 2018-07-10 2018-07-10 Document theme vector extraction method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810748564.1A CN108984526B (en) 2018-07-10 2018-07-10 Document theme vector extraction method based on deep learning

Publications (2)

Publication Number Publication Date
CN108984526A true CN108984526A (en) 2018-12-11
CN108984526B CN108984526B (en) 2021-05-07

Family

ID=64536620

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810748564.1A Active CN108984526B (en) 2018-07-10 2018-07-10 Document theme vector extraction method based on deep learning

Country Status (1)

Country Link
CN (1) CN108984526B (en)

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109871532A (en) * 2019-01-04 2019-06-11 平安科技(深圳)有限公司 Text subject extracting method, device and storage medium
CN109933804A (en) * 2019-03-27 2019-06-25 北京信息科技大学 Merge the keyword abstraction method of subject information and two-way LSTM
CN109960802A (en) * 2019-03-19 2019-07-02 四川大学 The information processing method and device of narrative text are reported for aviation safety
CN110083710A (en) * 2019-04-30 2019-08-02 北京工业大学 It is a kind of that generation method is defined based on Recognition with Recurrent Neural Network and the word of latent variable structure
CN110263343A (en) * 2019-06-24 2019-09-20 北京理工大学 The keyword abstraction method and system of phrase-based vector
CN110334358A (en) * 2019-04-28 2019-10-15 厦门大学 A kind of phrase table dendrography learning method of context-aware
CN110378409A (en) * 2019-07-15 2019-10-25 昆明理工大学 It is a kind of based on element association attention mechanism the Chinese get over news documents abstraction generating method
CN110457674A (en) * 2019-06-25 2019-11-15 西安电子科技大学 A kind of text prediction method of theme guidance
CN110472047A (en) * 2019-07-15 2019-11-19 昆明理工大学 A kind of Chinese of multiple features fusion gets over news viewpoint sentence abstracting method
CN110532395A (en) * 2019-05-13 2019-12-03 南京大学 A kind of method for building up of the term vector improved model based on semantic embedding
CN110766073A (en) * 2019-10-22 2020-02-07 湖南科技大学 Mobile application classification method for strengthening topic attention mechanism
CN110781256A (en) * 2019-08-30 2020-02-11 腾讯大地通途(北京)科技有限公司 Method and device for determining POI (Point of interest) matched with Wi-Fi (Wireless Fidelity) based on transmitted position data
CN110825848A (en) * 2019-06-10 2020-02-21 北京理工大学 Text classification method based on phrase vectors
CN111125434A (en) * 2019-11-26 2020-05-08 北京理工大学 Relation extraction method and system based on ensemble learning
CN111414483A (en) * 2019-01-04 2020-07-14 阿里巴巴集团控股有限公司 Document processing device and method
CN111696624A (en) * 2020-06-08 2020-09-22 天津大学 DNA binding protein identification and function annotation deep learning method based on self-attention mechanism
CN111753540A (en) * 2020-06-24 2020-10-09 云南电网有限责任公司信息中心 Method and system for collecting text data to perform Natural Language Processing (NLP)
CN112597311A (en) * 2020-12-28 2021-04-02 东方红卫星移动通信有限公司 Terminal information classification method and system based on low-earth-orbit satellite communication
CN112632966A (en) * 2020-12-30 2021-04-09 绿盟科技集团股份有限公司 Alarm information marking method, device, medium and equipment
CN112685538A (en) * 2020-12-30 2021-04-20 北京理工大学 Text vector retrieval method combined with external knowledge
CN112699662A (en) * 2020-12-31 2021-04-23 太原理工大学 False information early detection method based on text structure algorithm
CN112966551A (en) * 2021-01-29 2021-06-15 湖南科技学院 Method and device for acquiring video frame description information and electronic equipment
WO2021155705A1 (en) * 2020-02-06 2021-08-12 支付宝(杭州)信息技术有限公司 Text prediction model training method and apparatus
CN115763167A (en) * 2022-11-22 2023-03-07 黄华集团有限公司 Solid cabinet breaker and control method thereof

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105069143A (en) * 2015-08-19 2015-11-18 百度在线网络技术(北京)有限公司 Method and device for extracting keywords from document
CN106547735A (en) * 2016-10-25 2017-03-29 复旦大学 The structure and using method of the dynamic word or word vector based on the context-aware of deep learning
CN106909537A (en) * 2017-02-07 2017-06-30 中山大学 A kind of polysemy analysis method based on topic model and vector space
CN106919557A (en) * 2017-02-22 2017-07-04 中山大学 A kind of document vector generation method of combination topic model
CN107092596A (en) * 2017-04-24 2017-08-25 重庆邮电大学 Text emotion analysis method based on attention CNNs and CCR
CN107423282A (en) * 2017-05-24 2017-12-01 南京大学 Semantic Coherence Sexual Themes and the concurrent extracting method of term vector in text based on composite character
CN107562792A (en) * 2017-07-31 2018-01-09 同济大学 A kind of question and answer matching process based on deep learning

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105069143A (en) * 2015-08-19 2015-11-18 百度在线网络技术(北京)有限公司 Method and device for extracting keywords from document
CN106547735A (en) * 2016-10-25 2017-03-29 复旦大学 The structure and using method of the dynamic word or word vector based on the context-aware of deep learning
CN106909537A (en) * 2017-02-07 2017-06-30 中山大学 A kind of polysemy analysis method based on topic model and vector space
CN106919557A (en) * 2017-02-22 2017-07-04 中山大学 A kind of document vector generation method of combination topic model
CN107092596A (en) * 2017-04-24 2017-08-25 重庆邮电大学 Text emotion analysis method based on attention CNNs and CCR
CN107423282A (en) * 2017-05-24 2017-12-01 南京大学 Semantic Coherence Sexual Themes and the concurrent extracting method of term vector in text based on composite character
CN107562792A (en) * 2017-07-31 2018-01-09 同济大学 A kind of question and answer matching process based on deep learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
GUANGXU XUN 等: "Topic Discovery for Short Texts Using Word Embeddings", 《2016 IEEE 16TH INTERNATIONAL CONFERENCE ON DATA MINING》 *
胡朝举 等: "基于词向量技术和混合神经网络的情感分析", 《计算机应用研究》 *

Cited By (41)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111414483A (en) * 2019-01-04 2020-07-14 阿里巴巴集团控股有限公司 Document processing device and method
WO2020140633A1 (en) * 2019-01-04 2020-07-09 平安科技(深圳)有限公司 Text topic extraction method, apparatus, electronic device, and storage medium
CN109871532A (en) * 2019-01-04 2019-06-11 平安科技(深圳)有限公司 Text subject extracting method, device and storage medium
CN111414483B (en) * 2019-01-04 2023-03-28 阿里巴巴集团控股有限公司 Document processing device and method
CN109960802A (en) * 2019-03-19 2019-07-02 四川大学 The information processing method and device of narrative text are reported for aviation safety
CN109933804A (en) * 2019-03-27 2019-06-25 北京信息科技大学 Merge the keyword abstraction method of subject information and two-way LSTM
CN110334358A (en) * 2019-04-28 2019-10-15 厦门大学 A kind of phrase table dendrography learning method of context-aware
CN110083710B (en) * 2019-04-30 2021-04-02 北京工业大学 Word definition generation method based on cyclic neural network and latent variable structure
CN110083710A (en) * 2019-04-30 2019-08-02 北京工业大学 It is a kind of that generation method is defined based on Recognition with Recurrent Neural Network and the word of latent variable structure
CN110532395A (en) * 2019-05-13 2019-12-03 南京大学 A kind of method for building up of the term vector improved model based on semantic embedding
CN110532395B (en) * 2019-05-13 2021-09-28 南京大学 Semantic embedding-based word vector improvement model establishing method
CN110825848A (en) * 2019-06-10 2020-02-21 北京理工大学 Text classification method based on phrase vectors
CN110825848B (en) * 2019-06-10 2022-08-09 北京理工大学 Text classification method based on phrase vectors
CN110263343B (en) * 2019-06-24 2021-06-15 北京理工大学 Phrase vector-based keyword extraction method and system
CN110263343A (en) * 2019-06-24 2019-09-20 北京理工大学 The keyword abstraction method and system of phrase-based vector
CN110457674A (en) * 2019-06-25 2019-11-15 西安电子科技大学 A kind of text prediction method of theme guidance
CN110472047A (en) * 2019-07-15 2019-11-19 昆明理工大学 A kind of Chinese of multiple features fusion gets over news viewpoint sentence abstracting method
CN110472047B (en) * 2019-07-15 2022-12-13 昆明理工大学 Multi-feature fusion Chinese-Yue news viewpoint sentence extraction method
CN110378409A (en) * 2019-07-15 2019-10-25 昆明理工大学 It is a kind of based on element association attention mechanism the Chinese get over news documents abstraction generating method
CN110781256A (en) * 2019-08-30 2020-02-11 腾讯大地通途(北京)科技有限公司 Method and device for determining POI (Point of interest) matched with Wi-Fi (Wireless Fidelity) based on transmitted position data
CN110781256B (en) * 2019-08-30 2024-02-23 腾讯大地通途(北京)科技有限公司 Method and device for determining POI matched with Wi-Fi based on sending position data
CN110766073A (en) * 2019-10-22 2020-02-07 湖南科技大学 Mobile application classification method for strengthening topic attention mechanism
CN110766073B (en) * 2019-10-22 2023-10-27 湖南科技大学 Mobile application classification method for strengthening topic attention mechanism
CN111125434A (en) * 2019-11-26 2020-05-08 北京理工大学 Relation extraction method and system based on ensemble learning
CN111125434B (en) * 2019-11-26 2023-06-27 北京理工大学 Relation extraction method and system based on ensemble learning
WO2021155705A1 (en) * 2020-02-06 2021-08-12 支付宝(杭州)信息技术有限公司 Text prediction model training method and apparatus
CN111696624B (en) * 2020-06-08 2022-07-12 天津大学 DNA binding protein identification and function annotation deep learning method based on self-attention mechanism
CN111696624A (en) * 2020-06-08 2020-09-22 天津大学 DNA binding protein identification and function annotation deep learning method based on self-attention mechanism
CN111753540A (en) * 2020-06-24 2020-10-09 云南电网有限责任公司信息中心 Method and system for collecting text data to perform Natural Language Processing (NLP)
CN111753540B (en) * 2020-06-24 2023-04-07 云南电网有限责任公司信息中心 Method and system for collecting text data to perform Natural Language Processing (NLP)
CN112597311B (en) * 2020-12-28 2023-07-11 东方红卫星移动通信有限公司 Terminal information classification method and system based on low-orbit satellite communication
CN112597311A (en) * 2020-12-28 2021-04-02 东方红卫星移动通信有限公司 Terminal information classification method and system based on low-earth-orbit satellite communication
CN112632966A (en) * 2020-12-30 2021-04-09 绿盟科技集团股份有限公司 Alarm information marking method, device, medium and equipment
CN112685538B (en) * 2020-12-30 2022-10-14 北京理工大学 Text vector retrieval method combined with external knowledge
CN112632966B (en) * 2020-12-30 2023-07-21 绿盟科技集团股份有限公司 Alarm information marking method, device, medium and equipment
CN112685538A (en) * 2020-12-30 2021-04-20 北京理工大学 Text vector retrieval method combined with external knowledge
CN112699662B (en) * 2020-12-31 2022-08-16 太原理工大学 False information early detection method based on text structure algorithm
CN112699662A (en) * 2020-12-31 2021-04-23 太原理工大学 False information early detection method based on text structure algorithm
CN112966551A (en) * 2021-01-29 2021-06-15 湖南科技学院 Method and device for acquiring video frame description information and electronic equipment
CN115763167A (en) * 2022-11-22 2023-03-07 黄华集团有限公司 Solid cabinet breaker and control method thereof
CN115763167B (en) * 2022-11-22 2023-09-22 黄华集团有限公司 Solid cabinet circuit breaker and control method thereof

Also Published As

Publication number Publication date
CN108984526B (en) 2021-05-07

Similar Documents

Publication Publication Date Title
CN108984526A (en) A kind of document subject matter vector abstracting method based on deep learning
CN109472024B (en) Text classification method based on bidirectional circulation attention neural network
CN109241524B (en) Semantic analysis method and device, computer-readable storage medium and electronic equipment
CN108595632B (en) Hybrid neural network text classification method fusing abstract and main body characteristics
CN104834747B (en) Short text classification method based on convolutional neural networks
CN111241294B (en) Relationship extraction method of graph convolution network based on dependency analysis and keywords
CN111738003B (en) Named entity recognition model training method, named entity recognition method and medium
CN109753660B (en) LSTM-based winning bid web page named entity extraction method
CN110134946B (en) Machine reading understanding method for complex data
CN113011533A (en) Text classification method and device, computer equipment and storage medium
CN108376131A (en) Keyword abstraction method based on seq2seq deep neural network models
CN110717332B (en) News and case similarity calculation method based on asymmetric twin network
CN113392209B (en) Text clustering method based on artificial intelligence, related equipment and storage medium
CN110688862A (en) Mongolian-Chinese inter-translation method based on transfer learning
Cai et al. Intelligent question answering in restricted domains using deep learning and question pair matching
CN108733647B (en) Word vector generation method based on Gaussian distribution
CN111581967B (en) News theme event detection method combining LW2V with triple network
CN112069312B (en) Text classification method based on entity recognition and electronic device
CN110619121A (en) Entity relation extraction method based on improved depth residual error network and attention mechanism
CN113255366B (en) Aspect-level text emotion analysis method based on heterogeneous graph neural network
CN115392252A (en) Entity identification method integrating self-attention and hierarchical residual error memory network
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN114417851A (en) Emotion analysis method based on keyword weighted information
CN110569355B (en) Viewpoint target extraction and target emotion classification combined method and system based on word blocks
CN106610949A (en) Text feature extraction method based on semantic analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant