CN108984526A

CN108984526A - A kind of document subject matter vector abstracting method based on deep learning

Info

Publication number: CN108984526A
Application number: CN201810748564.1A
Authority: CN
Inventors: 高扬; 黄河燕; 陆池
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2018-07-10
Filing date: 2018-07-10
Publication date: 2018-12-11
Anticipated expiration: 2038-07-10
Also published as: CN108984526B

Abstract

The present invention relates to a kind of document subject matter vector abstracting method based on deep learning, belongs to natural language processing technique field.The method of the present invention extracts the semantic information of the deep layer with part using convolutional neural networks, timing information is learnt out using LSTM model, so that the semanteme of vector is more comprehensive, select the implicit cooccurrence relation of context phrase and document subject matter, avoid the shortcomings that some theme vector models based on sentence are for short text, CNN and LSTM model is organically combined using attention mechanism, the Deep Semantics, timing information and significant information for having learnt context more effectively construct the model that grade theme vector extracts.

Description

A kind of document subject matter vector abstracting method based on deep learning

Technical field

The present invention relates to a kind of document subject matter vector abstracting method based on deep learning, belongs to natural language processing technique Field.

Background technique

In big data era of today, how to find that the theme of magnanimity internet text notebook data is-a research emphasis.To text The theme of notebook data is analyzed, and document subject matter vector is substantially the Deep Semantics for indicating document, is theme and semantic interior It is combining.Extracting document subject matter vector can be widely used in natural language processing task, including social networks and new The analysis of public opinion of media, the timely acquisition of hot news etc..Therefore, how efficiently to extract document subject matter vector be-it is a Important subject.

For text data, theme might not be embodied directly on specific word content, this allows for digging The implicit theme of pick text becomes difficult, and needing to extract document according to relationships such as the word of text, sentence, paragraphs is included Thematic meaning, and combine the chapter relationship of document to extract the theme of document.In recent years with statistics natural language at Reason method and corpus it is abundant, the text subject modeling method based on " word-theme " " document-theme " is also mentioned in succession Out, basic thought is the assumption that the theme of each word and document is to obey a statistical probability distribution, by corpus number According to training, calculate the probability distribution of its document subject matter, then clustered further according to this document subject matter.

Want Correct Analysis to go out the theme of each document, conventional method be subject analysis is carried out to each word of text, but It is that this method has one very big: really determines that the word of text subject only accounts for few portion of text word in fact Point, therefore conventional method can largely analyze the word unrelated with theme, the unrelated word of this aspect causes to realize It is next computationally intensive, it is on the other hand inaccurate there is also being extracted for text subject, text internal association degree relationship cannot be combined The problem of excavating text Deep Semantics.

With the promotion of hardware performance and the continuous expansion of data scale, deep learning is also widely used in each neck Among domain, experimental result is significantly improved on its original base.Deep learning is with its graceful model, flexible framework etc. Feature had obtained extensive utilization in recent years in conjunction in the method for word Embedding and document Embedding.All In deep learning method, CNN (Convolutional Neural Network, convolutional neural networks) and LSTM model (Long Short-Term Memory, shot and long term memory network model) it is two of most mainstream.In natural language processing task, it is based on The text analyzing method of CNN and LSTM model can be good at finding the potential applications feature of text, in semantic analysis calculating The natural language processings task such as automatic abstract, sentiment analysis, machine translation is given greatly to help.

Summary of the invention

The purpose of the invention is to overcome the deficiencies of existing technologies, solve how text internal association degree relationship to be combined to dig The problem of digging text Deep Semantics, proposes a kind of document subject matter vector abstracting method based on deep learning.The present invention is document Theme vector modeling more focuses in the analysis to document compression, excavates text feature and theme vector is hidden The correlation contained, to learn document subject matter vector.

The core idea of the invention is as follows: the semanteme of context phrase is extracted using CNN, and the semanteme extracted is input to In LSTM model, the different location of text and the importance of different meaning words are extracted using attention mechanism, to remain Important information also completes the combination of CNN and LSTM model, excavates the internal association between context, has learnt tool There are Deep Semantics and significant document subject matter vector.

The method of the present invention is achieved through the following technical solutions.

A kind of document subject matter vector abstracting method based on deep learning, comprising the following steps:

Step 1: related definition is carried out, it is specific as follows:

Definition 1: document D, D=[w₁,w₂,...,w_i,...,w_n], w_iIndicate i-th of word of document D；

Define 2: prediction word w_d+1, indicate the target word for needing to learn；

Define 3: window word is made of the words continuously occurred several in text, between window word exist hide Internal association；

Definition 4: context phrase, the window word that expression prediction word position occurs before, length of window l, Context phrase is denoted as w_d-l,w_d-l+1,...,w_d；

Define 5: document subject matter mapping matrix is learnt by LDA algorithm (Latent Dirichlet Allocation) It arrives, every a line represents the theme of a document；

Define 6:N_dAnd doc_id, N_dIndicate the number of document in corpus, doc_idIndicate the position of document；Each document pair Answer a unique doc_id, wherein 1≤doc_id≤N_d；

Step 2: study obtains the semantic vector of context phrase using CNN.

Step 3: obtaining hidden layer vector h using the semanteme of LSTM model learning context phrase_d-l,h_d-l+1,..., h_d。

Step 4: CNN and LSTM model is organically combined by attention mechanism, obtain context phrase semantic vector Average value

Step 5: utilizing the average value of context phrase semantic vector by the method for logistic regressionAnd text Shelves subject information predicts target word w_d+1, obtain target word w_d+1Prediction probability.

Beneficial effect

A kind of document subject matter vector abstracting method based on deep learning of the present invention, compares the prior art, and have has as follows Beneficial effect:

1. extracting the semantic information of the deep layer with part using CNN；

2. timing information is learnt out using LSTM model, so that the semanteme of vector is more comprehensive；

3. select context phrase and document subject matter implicit cooccurrence relation, avoid it is some based on the theme of sentence to Measure the shortcomings that model is for short text；

4. CNN and LSTM model is organically combined using attention mechanism, learnt context Deep Semantics, Timing information and significant information more effectively construct the model that grade theme vector extracts.

Detailed description of the invention

Fig. 1 is a kind of flow chart of the document subject matter vector abstracting method based on deep learning of the present invention.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, below according to accompanying drawings and embodiments pair Abstract method of the present invention is further described.

A kind of document subject matter vector abstracting method based on deep learning, basic implementation process are as follows:

Step 1: related definition is carried out, it is specific as follows:

Define 2: prediction word w_d+1；, indicate the target word for needing to learn；

Define 4: context phrase (w_d-l,w_d-l+1,...,w_d), indicate the window occurred before prediction word position Word, context phrase length are l；

Define 5: document subject matter mapping matrix learns to obtain by LDA algorithm, and every a line represents the theme of a document；

Step 2: study obtains the semantic vector Context of context phrase using CNN.

The specific implementation process is as follows:

Step 2.1 utilizes word2vec scheduling algorithm, the term vector matrix of Training document D, and term vector matrix size is n × m, N indicates the length of term vector matrix, the width of m table term vector matrix；

The term vector matrix that step 2.2 obtains the corresponding term vector of word each in context phrase from step 2.1 is taken out It takes out, to obtain context phrase w_d-l,w_d-l+1,...,w_dVector matrix M；

Step 2.3 calculates the semantic vector Context of context phrase using CNN.Especially by step 2.2 obtain to M and K layers of size of moment matrix are C_l×C_mConvolution kernel operated；

Wherein, K indicates the number of convolution kernel, and K is equal to 128, C in present embodiment_lIndicate the length of convolution kernel, and C_l =l, C_mIndicate the width of convolution kernel, and C_m=m；

The semantic vector Context of context phrase is calculated by formula (1):

1≤k≤K

Context=[Context₁,Context₂,...,Context_K]

Wherein, Context_kIndicate the kth dimension of the semantic vector of context phrase, l indicates context phrase length, m table Show that the width of term vector matrix, i.e. term vector dimension, d indicate the initial position of first word in context phrase, c_pqIt is convolution The weight parameter of core pth row and q column, M_pqIndicate that the pth row and q column data of vector matrix M, b are the biasing ginsengs of convolution kernel Number；

The specific implementation process is as follows:

T assignment d-l, i.e. t=d-l, t are indicated t moment by step 3.1；

Step 3.2 is by x_tAssignment w_tTerm vector, x_tIndicate the term vector of t moment input, w_tIndicate that t moment inputs Word；

Wherein, w_tThe term vector matrix that is exported by step 2.1 of term vector map to obtain, i.e. extraction w_tIn vector matrix M The term vector of corresponding position；

Step 3.3 is by x_tAs the input of LSTM model, the hidden layer vector h of t moment is obtained_t；

The specific implementation process is as follows:

The forgetting door f of step 3.3.1 calculating t moment_t, information is forgotten for controlling, and is calculated by formula (2)；

f_t=σ (W_fx_t+U_fh_t-1+b_f) (2)

Wherein, W_fExpression parameter matrix, x_tIndicate the term vector of t moment input, U_fExpression parameter matrix, h_t-1Indicate t- The hidden layer vector at 1 moment, b_fIndicate bias vector parameter, as t=d-l, h_t-1=h_d-l-1, and h_d-l-1For null vector, σ table Show Sigmoid function, is the activation primitive of LSTM model；

The input gate i of step 3.3.2 calculating t moment_t, new information to be added is needed for controlling current time, passes through public affairs Formula (3) calculates；

i_t=σ (W_ix_t+U_ih_t-1+b_i) (3)

Wherein, W_iExpression parameter matrix, x_tIndicate the term vector of t moment input, U_iExpression parameter matrix, h_t-1Indicate t- The hidden layer vector at 1 moment, b_iIndicate bias vector parameter, it is the activation primitive of LSTM model that σ, which indicates Sigmoid function,；

Step 3.3.3 calculates the information that t moment updatesIt is calculated by formula (4)；

Wherein,Expression parameter matrix, x_tIndicate the term vector of t moment input,Expression parameter matrix, h_t-1It indicates The hidden layer vector at t-1 moment,Indicate bias vector parameter, it is the activation of LSTM model that tanh, which indicates hyperbolic tangent function, Function；

Step 3.3.4 calculates the information of t moment, and the information that the information of last moment is updated with current time is added It arrives, is calculated by formula (5)；

Wherein, c_tIndicate the information of t moment, f_tIndicate that t moment forgets door, c_t-1Indicate the information at t-1 moment, i_tIndicate t The input gate at moment,Indicate the information that t moment updates, the multiplication cross of ° expression vector；

The out gate o of step 3.3.5 calculating t moment_t, for controlling input information, pass through formula (6) calculating:

o_t=σ (W_ox_t+U₀h_t-1+b_o) (6)

Wherein, W_oExpression parameter matrix, x_tIndicate the term vector of t moment input, U₀Expression parameter matrix, h_t-1Indicate t- The hidden layer vector at 1 moment, b_oIndicate bias vector parameter, it is the activation primitive of LSTM model that σ, which indicates Sigmoid function,；Its In, the parameter matrix W in step 3.3.1-3.3.3 and step 3.3.5_f, U_f, W_i, U_i,W_o, U_oMatrix element it is big Small difference, bias vector parameter b_f, b_i,b_oIn element size it is different；The hidden layer vector of step 3.3.6 calculating t moment h_t, it is calculated by formula (7):

h_t=o_tοc_t (7)

Wherein, o_tIndicate the out gate of t moment, c_tIndicate the information of t moment；

Step 3.4 judges whether t is equal to d, and t adds 1 if being not equal to, leapfrog rapid 3.2；If being equal to, export hidden layer to Measure h_d-l,h_d-l+1,...,h_d, jump into step 4；

Step 4:, by CNN and LSTM models coupling, obtaining the flat of context phrase semantic vector using attention mechanism Mean valueThe specific implementation process is as follows:

The context phrase semantic vector that step 4.1 is obtained using step 2, obtains each word by attention mechanism Importance factor α on the semantic vector of context phrase is calculated especially by formula (8):

d-l≤t≤d

α=[α_d-l,α_d-l+1,...,α_d] (8)

Wherein, α_tIndicate importance factor of the t moment word on the semantic vector of context phrase, Context is indicated The semantic vector of the context phrase obtained in step 2, x_tIndicate the term vector of t moment input, x_iIndicate that the i-th moment inputted Term vector；The transposition of T expression vector；E indicates with e, i.e. the natural constant exponential function that is bottom；

Step 4.2 calculates the hidden layer vector h ' based on attention mechanism Weight, is calculated by formula (9)；

h′_t=α_t*h_t

d-l≤t≤d

H '=[h '_d-l,h′_d-l+1,...,h′_d] (9)

Wherein, h '_tIndicate t moment weight hidden layer vector h '_t, α_tIndicate each word of t moment in the language of context phrase Importance factor on adopted vector, h_tIndicate t moment hidden layer vector；

Step 4.3 is operated using mean-pooling, calculates the average value of context phrase semantic vector It is calculated by formula (10):

Wherein, h '_tIndicate t moment weight hidden layer vector h '_t；

Step 5: being believed by the method for logistic regression using the average value and document subject matter of context phrase semantic vector Breath prediction target word w_d+1, obtain target word w_d+1Prediction probability.The specific implementation process is as follows:

Step 5.1 learns document subject matter mapping matrix using LDA algorithm, then according to document subject matter mapping matrix and doc_id By each document be mapped to term vector matrix width in a length and step 2.1 it is equal-dimensional vector D_z；

The vector D that step 5.2 exports step 5.1_zWith the average value of the context phrase semantic vector of step 4 outputIt is stitched together, obtains splicing vector V_d,

The V that step 5.3 is exported using step 5.2_dTo predict target word w_d+1.Especially by logistic regression method into Row classification, objective function such as formula (11)

Wherein, θ_d+1It is target word w_d+1The corresponding parameter in position, θ_iWord w in corresponding vocabulary_iCorresponding parameter, | V | indicate the size of vocabulary, V_dIt is the splicing vector that step 5.2 obtains, exp indicates that using e as the exponential function at bottom, Σ expression is asked With；P indicates that probability, y indicate dependent variable, T representing matrix transposition.

Step 5.4 passes through the loss function of formula (12) calculating target function (11) using the method for cross entropy:

L=-log (P (y=w_d+1|V_d)) (12)

Wherein, w_d+1Indicate target word, V_dIt is the splicing vector of step 4.2, log () indicates denary logarithm letter Number；

Loss function (12) by Sampled Softmax algorithm and small lot stochastic gradient descent parameter updating method into Row, which updates, to be solved, and document subject matter vector is obtained.

So far, it from step 1 to step 5, completes with Deep Semantics and significant document subject matter vector.

Embodiment

The present embodiment describes specific implementation process of the invention, as shown in Figure 1.

It will be seen from figure 1 that a kind of process of the document subject matter vector abstracting method based on deep learning of the present invention is as follows:

Step A pretreatment；The meaningless symbol in corpus, such as spcial character are got rid of first, and then text is carried out Participle.Participle is exactly the process that continuous word sequence is divided into individual word according to set morphological rule, thus will Sentences decomposition is that several continuous significant word strings are used for subsequent analysis.Participle operation is divided using PTB segmenter Word processing.After participle, vocabulary is constructed to original text, in the present embodiment, what vocabulary was chosen is the word of training text Frequency preceding 20000 words from high in the end, that is, the size of vocabulary V is 20000.After vocabulary selection, according to the rope of vocabulary Draw the vocabulary index data for constructing original language material, this input of text vocabulary index data as model.

Step B learns term vector using word2vec algorithm.Word in document is input in word2vec algorithm, is obtained Term vector, objective function such as formula (13):

Wherein, k is window word, and i is current word, and Corp is word size in corpus, utilizes gradient descent method Study obtains the term vector of 128 dimensions；

Step C extracts context phrase semantic vector using CNN, learns context phrase hidden layer vector using RNN；

Wherein, context phrase semantic vector is extracted using CNN, is using RNN study context phrase hidden layer vector And column count, specific to the present embodiment:

Context phrase semantic vector is extracted using CNN；One K layers of random initializtion are carried out first with Gaussian Profile Size is C_l×C_mConvolution kernel, for given context phrase w_d-l,w_d-l+1,...,w_d, the term vector acquired by step B These context phrases are mapped to the matrix that size is l × m, wherein l is the length of context phrase, and m is the dimension of term vector The matrix is carried out convolution operation, shown in concrete operations mode such as formula (1), in this way by degree on the convolution kernel of random initializtion A vector Context is just obtained, which is exactly the semantic vector of context phrase；

Learn context phrase hidden layer vector using RNN；By context phrase w_d-l,w_d-l+1,...,w_dCorresponding word to Amount is successively inputted in LSTM model, by the hidden layer vector h at 0 moment₀Each dimension be set as 0, then using public affairs Formula (2)-(7), which successively calculate, forgets door, input gate, out gate and final result context phrase hidden layer vector, dimension size It is set as 128；

Step D is calculated the semantic vector of Weight using attention mechanism, calculates document subject matter distribution；

Wherein, calculating the semantic vector of Weight using attention mechanism, calculate document subject matter distribution is simultaneously column count, Specific to the present embodiment:

The semantic vector of Weight is calculated using attention mechanism；It is obtained according to the obtained term vector of step B and step C The semantic vector of context phrase carries out the operation of attention mechanism, attention getting for each of context phrase word Power factor-alpha_t, α_tIt is a real number between 0 to 1, number is bigger, then the term vector information of its corresponding position will be got over More is retained in last mean-pooling layer, therefore its size illustrates that current term is characterizing entire phrase meaning Importance, that is to say, that more important word will more be paid attention to；

Calculate document subject matter distribution；It is specifically calculated using LDA algorithm, document D is input in LDA algorithm first, is obtained To the theme distribution of each document D, which is denoted as D directly as final result_z；

Step E predicts target word, learns document subject matter vector；By the semantic vector and D of Weight_zDirect splicing rises Come, the maximum probability for target word occur is joined by Sampled Softmax algorithm and small lot stochastic gradient descent Number update method can acquire document subject matter vector.

Claims

1. a kind of document subject matter vector abstracting method based on deep learning, which comprises the following steps:

Step 1: related definition is carried out, it is specific as follows:

Define 3: window word is made of the word continuously occurred in text, there is hiding internal association between window word；

Definition 4: context phrase: w_d-l,w_d-l+1,...,w_d, indicate the window word occurred before prediction word position, Context phrase length is l；

Define 6:N_dAnd doc_id, N_dIndicate the number of document in corpus, doc_idIndicate the position of document；Each document is corresponding only One doc_id, wherein 1≤doc_id≤N_d；

Step 2: study obtains the semantic vector of context phrase using convolutional neural networks CNN；

Step 3: obtaining hidden layer vector h using the semanteme of shot and long term memory network model LSTM study context phrase_d-l, h_d-l+1..., h_d；The specific implementation process is as follows:

T assignment d-l, i.e. t=d-l, t are indicated t moment by step 3.1；

Step 3.2 is by x_tAssignment w_tTerm vector, x_tIndicate the term vector of t moment input, w_tIndicate the list of t moment input Word；

Wherein, w_tThe term vector matrix that is exported by step 2.1 of term vector map to obtain, i.e. extraction w_tIt is corresponding in vector matrix M The term vector of position；

Step 3.4 judges whether t is equal to d, and t adds 1 if being not equal to, leapfrog rapid 3.2；If being equal to, hidden layer vector is exported h_d-l,h_d-l+1,...,h_d, jump into step 4；

Step 4: CNN and LSTM model is organically combined by attention mechanism, obtain the flat of context phrase semantic vector Mean valueConcrete methods of realizing is as follows:

The context phrase semantic vector that step 4.1 is obtained using step 2 obtains each word upper by attention mechanism The hereafter importance factor α on the semantic vector of phrase is calculated especially by following formula:

d-l≤t≤d

α=[α_d-l,α_d-l+1,...,α_d]

Wherein, α_tIndicate importance factor of the t moment word on the semantic vector of context phrase, Context indicates step 2 The semantic vector of the context phrase of middle acquisition, x_tIndicate the term vector of t moment input, x_iIndicate the i-th moment input word to Amount；The transposition of T expression vector；E indicates with e, i.e. the natural constant exponential function that is bottom；

Step 4.2 calculates the hidden layer vector h ' based on attention mechanism Weight, is calculated by following formula；

h_t'=α_t*h_t

d-l≤t≤d

H '=[h '_d-l,h′_d-l+1,...,h_d′]

Wherein, h_t' indicate t moment weight hidden layer vector h_t', α_tIndicate each word of t moment context phrase it is semantic to Importance factor in amount, h_tIndicate t moment hidden layer vector；

Step 4.3 is operated using mean-pooling, calculates the average value of context phrase semantic vectorPass through Following formula (10) calculates:

Wherein, h_t' indicate t moment weight hidden layer vector h_t′；

Step 5: utilizing the average value of context phrase semantic vector by the method for logistic regressionWith document master Inscribe information prediction target word w_d+1, obtain target word w_d+1Prediction probability.

2. a kind of document subject matter vector abstracting method based on deep learning as described in claim 1, which is characterized in that described The concrete methods of realizing of step 2 is as follows:

The term vector matrix of step 2.1 Training document D, term vector matrix size are n × m, and n indicates the length of term vector matrix, m table The width of term vector matrix；

Step 2.2 extracts the term vector matrix that the corresponding term vector of word each in context phrase is obtained from step 2.1 Come, obtains context phrase w_d-l,w_d-l+1,...,w_dVector matrix M；

Step 2.3 calculates the semantic vector Context of context phrase, the moment of a vector obtained especially by step 2.2 using CNN M and K layers of size of battle array is C_l×C_mConvolution kernel operated；

Wherein, K indicates the number of convolution kernel, C_lIndicate the length of convolution kernel, and C_l=l, C_mIndicate the width of convolution kernel, and C_m=m；

The semantic vector Context of context phrase is calculated by formula (1):

1≤k≤K

Context=[Context₁,Context₂,...,Context_K]

Wherein, Context_kIndicate context phrase semantic vector kth dimension, l indicate context phrase length, m indicate word to The width of moment matrix, i.e. term vector dimension, d indicate the initial position of first word in context phrase, c_pqIt is convolution kernel pth row With the weight parameter of q column, M_pqIndicate the pth row and q column data of vector matrix M, b is the offset parameter of convolution kernel.

3. a kind of document subject matter vector abstracting method based on deep learning as described in claim 1, which is characterized in that described The concrete methods of realizing of step 3.3 is as follows:

f_t=σ (W_fx_t+U_fh_t-1+b_f) (2)

Wherein, W_fExpression parameter matrix, x_tIndicate the term vector of t moment input, U_fExpression parameter matrix, h_t-1When indicating t-1 The hidden layer vector at quarter, b_fIndicate bias vector parameter, as t=d-l, h_t-1=h_d-l-1, and h_d-l-1For null vector, σ is indicated Sigmoid function is the activation primitive of LSTM model；

The input gate i of step 3.3.2 calculating t moment_t, new information to be added is needed for controlling current time, is passed through formula (3) It calculates；

i_t=σ (W_ix_t+U_ih_t-1+b_i) (3)

Wherein, W_iExpression parameter matrix, x_tIndicate the term vector of t moment input, U_iExpression parameter matrix, h_t-1When indicating t-1 The hidden layer vector at quarter, b_iIndicate bias vector parameter, it is the activation primitive of LSTM model that σ, which indicates Sigmoid function,；

Wherein,Expression parameter matrix, x_tIndicate the term vector of t moment input,Expression parameter matrix, h_t-1Indicate t-1 The hidden layer vector at moment,Indicate bias vector parameter, it is the activation letter of LSTM model that tanh, which indicates hyperbolic tangent function, Number；

Step 3.3.4 calculates the information of t moment, and the information of last moment is added to obtain with the information that current time updates, and leads to Cross formula (5) calculating；

Wherein, c_tIndicate the information of t moment, f_tIndicate that t moment forgets door, c_t-1Indicate the information at t-1 moment, i_tIndicate t moment Input gate,Indicate the information that t moment updates,Indicate the multiplication cross of vector；

o_t=σ (W_ox_t+U₀h_t-1+b_o) (6)

Wherein, W_oExpression parameter matrix, x_tIndicate the term vector of t moment input, U₀Expression parameter matrix, h_t-1When indicating t-1 The hidden layer vector at quarter, b_oIndicate bias vector parameter, it is the activation primitive of LSTM model that σ, which indicates Sigmoid function,；Wherein, Parameter matrix W_f, U_f, W_i, U_i,W_o, U_oMatrix element it is of different sizes, bias vector parameter b_f, b_i,b_oIn Element size is different；

The hidden layer vector h of step 3.3.6 calculating t moment_t, it is calculated by formula (7):

Wherein, o_tIndicate the out gate of t moment, c_tIndicate the information of t moment.

4. a kind of document subject matter vector abstracting method based on deep learning as described in claim 1, which is characterized in that described The concrete methods of realizing of step 5 is as follows:

Step 5.1 learns document subject matter mapping matrix, then according to document subject matter mapping matrix and doc_idEach document is reflected Penetrate into-a length and step 2.1 in term vector matrix width it is equal-dimensional vector D_z；

The V that step 5.3 is exported using step 5.2_dTo predict target word w_d+1；

L=-log (P (y=w_d+1|V_d)) (12)

Wherein, w_d+1Indicate target word, V_dIt is the splicing vector of step 4.2, log () indicates denary logarithm function；

Loss function (12) is carried out more by Sampled Softmax algorithm and small lot stochastic gradient descent parameter updating method It is new to solve, obtain document subject matter vector.

5. a kind of document subject matter vector abstracting method based on deep learning as claimed in claim 4, which is characterized in that described Step 5.3, classified by the method for logistic regression, objective function such as formula (11)

Wherein, θ_d+1It is target word w_d+1The corresponding parameter in position, θ_iWord w in corresponding vocabulary_iCorresponding parameter, | V | table Show the size of vocabulary, V_dIt is the splicing vector that step 5.2 obtains, exp indicates that, using e as the exponential function at bottom, Σ indicates summation；P Indicate that probability, y indicate dependent variable, T representing matrix transposition.