CN109241377A

CN109241377A - A kind of text document representation method and device based on the enhancing of deep learning topic information

Info

Publication number: CN109241377A
Application number: CN201810999545.6A
Authority: CN
Inventors: 张文跃; 王素格; 李德玉
Original assignee: Shanxi University
Current assignee: Shanxi University
Priority date: 2018-08-30
Filing date: 2018-08-30
Publication date: 2019-01-18
Anticipated expiration: 2038-08-30
Also published as: CN109241377B

Abstract

The invention discloses a kind of text document representation methods and device based on the enhancing of deep learning topic information.Method includes: S1, carries out data preprocessing operation to the corpus document of textual form.S2, design text sequence layer, will be embedded in its contextual information in word order in the expression vector of word each in document.S3, sequential element is transitioned into higher level topic information by attention layer.S4, in topic layer, generate expression of the current document D on all topic directions.S5, the similarity degree between all topic informations is limited.S6, topic is indicated that Vector Fusion is the semantic expressiveness vector Rep of document D in expression layer.S7, it is updated by classifier and objective function to by the parameter of Rep, text sequence context semantic information and potential topic information can efficiently be embedded into document representation vector by this method, and these expression vectors by topic information enhancing can significantly improve the performance of the text mining mode using them.

Description

A kind of text document representation method and device based on the enhancing of deep learning topic information

Technical field

The present invention relates to computer versions to indicate learning areas, in particular to a kind of to enhance topic information based on deep learning The text document representation method of enhancing and a kind of text document based on deep learning enhancing topic information enhancing indicate device.

Background technique

To text carry out documentation level, globality hold be many text-processing tasks important need.Currently, this One problem is generally solved by text representation study.Text document rank indicates that learning tasks are directed generally to construct a kind of incite somebody to action Text document can be directly the method for the expression vector of Computing according to being converted into it in semantic information.It is specific next It says, is exactly the Real-valued vector for containing its semantic regular length by the document representation of textual form.Nowadays, document representation It practises and has become basic, popularity application in fields such as natural language processing, text mining and information extractions.

Current most widely used document representation learning method substantially has three categories, their each have their own shortcomings: (1) Based on " bag of words " (BoW) model, also referred to as " vector space model ".This class model generate expression vector be it is sparse, Non- real number, this kind of vector is often ineffective in application later；(2) based on the method for semantic analysis, such as " probability is latent In semantic analysis " model, " LDA document subject matter generates model ", this class model has ignored the contextual information of word order in text, this Constrain the semantic carrying capacity for indicating vector；(3) the shot and long term memory models (LSTM) based on Recognition with Recurrent Neural Network are extensive Distributed applied to text document indicates that vector generates.However, common LSTM may be not sufficient to obtain the overall situation of corpus The subject information of property.

The shortcomings that above method, shows the difficulty that document representation learning tasks face at present: when model is based on the corpus overall situation The contextual information being often lost in document when the topic information of property (such as can not just be determined without contextual information " apple " word refers to fruit or scientific & technical corporation), and topic information of overall importance when being absorbed in these local messages Again ignored (correlation between document), furthermore between topic information there is no limit mechanism be also easy to cause they tend to it is similar from And reduce model performance (such as separate " economy ", " amusement ", " battlebus ", " warship " in this way there are the topic groups of redundancy condition). All these defects can make the expression vector of document be short of certain semantic informations, after will limit these indicate vectors at it Effect in his application.

Summary of the invention

The present invention is directed to solve at least some of the technical problems in related technologies.

For this purpose, an object of the present invention is to provide a kind of text texts based on deep learning enhancing topic information enhancing Shelves representation method can make text document generate and not only include word order contextual information but also include the dense, real of topic information The expression vector of number type.

It is another object of the present invention to propose a kind of text document based on deep learning enhancing topic information enhancing Indicate device.

To achieve the above object, one aspect of the present invention embodiment proposes a kind of based on deep learning enhancing topic information increasing Strong text document representation method, comprising the following steps:

S1, to the document D={ w being made of in certain corpus containing K topic n word₁,w₂,...,w_nCarry out clearly Reason extracts, the data preprocessing operation of conversion and arrangement, obtains term vector matrix D={ x of document₁,x₂,...,x_n}；

S2 constructs text sequence layer using the sequence relation between word, and implementation sequence form shot and long term memory models obtain The potential applications matrix H s={ h of document₁,h₂,...,h_n, wherein h_i=f₁(x_i,h_i-1), h₀=f₁(x₀), f₁For neural network Nodal operation；

S3, by the potential applications matrix H s={ h₁,h₂,...,h_nGenerate corresponding attention intensity matrix A={ a₁, a₂,...,a_n, and it will gain attention power weight matrix A* after A matrix transposition by row normalization, wherein a_i=f₂(h_i), f₂It is to turn Change function；

The potential applications matrix H s and attention weight matrix A* is realized fusion, obtains all words of document by S4 The mapping matrix of topic indicates VTs, VTs=f₃(Hs, A*), wherein f₃It is conversion function；

S5 indicates that the similarity degree of VTs constrains using mapping matrix of the label information across document to the topic, Obtaining the enhanced mapping matrix of topic information indicates VTk；

S6 merges the VTk, obtains the semantic expressiveness vector Rep of document D, wherein Rep=f₄(VTk), In, f₄For fusion function；

S7 classifies to the Rep by topic classifier, and is obtained according to classification accuracy and topic similarity index The model parameter in step S1~S6 is updated to error extension, and using target function gradient descending method.

The text document representation method based on deep learning enhancing topic information enhancing proposed according to embodiments of the present invention, Term vector is converted by the word of textual form using word embedded technology first, so that the form of document has become real number matrix, is connect According to the characteristics of text context semantic information sequentiality set up text sequence layer.The real number matrix of document by sequence layer it Afterwards, become the potential applications matrix with context semantic information.It is calculated corresponding thereto followed by potential applications matrix Attention weight matrix, and by both fusion realize the enhancing to the topic information of higher granularity.Then pass through topic Similarity tied mechanism makes should be as distinguishable from one another as possible between topic, so that obtaining all topics of document indicates.Finally The expression of all topics is merged, as the expression vector after the topic information enhancement of the document, this article this document is as a result, Not only included word order contextual information but also included dense, Real-valued the expression vector of topic information, and reduced topic redundancy.

To achieve the above object, another aspect of the present invention embodiment proposes a kind of based on deep learning enhancing topic information The text document of enhancing indicates device, including text sequence layer, attention layer, topic layer and expression layer, wherein the text sequence Column layer is used for the document D={ w being made of in certain corpus containing K topic n word₁,w₂,...,w_nCleared up, taken out The data preprocessing operation for taking, converting and arranging obtains term vector matrix D={ x of document₁,x₂,...,x_n, and by document Term vector matrix D={ x₁,x₂,...,x_nBy sequence form shot and long term memory models, obtain the potential applications matrix H s of document ={ h₁,h₂,...,h_n, wherein h_i=f₁(x_i,h_i-1), h₀=f₁(x₀), f₁For neural network node operation；The attention Layer realizes that word grade is clipped to connection and the realization of two kinds of granular informations of topic rank for topic information in extracting and developing text The function of unknown message is extracted from Given information；By the potential applications matrix H s={ h₁,h₂,...,h_nGenerate corresponding note Anticipate power intensity matrix A={ a₁,a₂,...,a_n, and the power weight matrix A* that gains attention will be normalized by row after A matrix transposition, Middle a_i=f₂(h_i), f₂It is conversion function, the potential applications matrix H s and attention weight matrix A* is realized into fusion, The mapping matrix that the topic layer is used to obtain all topics of document indicates VTs, VTs=f₃(Hs, A*), wherein f₃It is conversion letter Number；And indicate that the similarity degree of VTs constrains using mapping matrix of the label information across document to the topic, obtain words Mapping matrix after inscribing information enhancement indicates VTk；The expression layer obtains the semanteme of document D for merging to the VTk Indicate vector Rep, wherein Rep=f₄(VTk), wherein f₄For fusion function, and the Rep is divided by topic classifier Class, and error extension is obtained according to classification accuracy and topic similarity index, and more using target function gradient descending method New model parameter.

It is indicated according to the text document based on deep learning enhancing topic information enhancing proposed according to embodiments of the present invention Device converts term vector for the word of textual form using word embedded technology first, so that the form of document has become real number square Battle array sets up text sequence layer the characteristics of then according to text context semantic information sequentiality.The real number matrix of document passes through sequence After column layer, become the potential applications matrix with context semantic information.Then potential applications matrix is utilized in attention layer Attention weight matrix corresponding thereto is calculated, and realizes the increasing to the topic information of higher granularity by the fusion of the two By force.Then being made in topic layer by topic similarity tied mechanism should be as distinguishable from one another as possible between topic, to obtain All topics of document indicate.Finally the expression of all topics is merged, as the table after the topic information enhancement of the document Show vector, as a result, this article this document be not only comprising word order contextual information but also include topic information dense, Real-valued table Show vector, reduces topic redundancy.

Compared with prior art, the invention has the following advantages:

1. sequence LSTM model is used to enable the upper and lower of the model preferably fusing text for the modeling of the word sequence of text Literary information；

2. the extraction type attention mechanism of brand new supports the processing of " sequence to tree " structure, it is used for from text sequence Topic information is extracted in information.Furthermore " word-topic " related information of text can be not only embedded in by it indicates vector, can be with The middle word that having to explicitly returns to document can be used as visualization result to the support of different topics and be shown and test；

3. the introducing of the similarity tied mechanism of topic layer improves " long tail effect " of original topic model, i.e., certain words Topic is excessively similar to enable model degradation.Meanwhile general attention power mechanism faces homoplasy problem and is also resolved.Homoplasy is by counting Variable is very few caused during calculating attention, it makes all topic attention weight distributions tend to identical, and similitude is about Beam mechanism is that its calculating process increases variable；

4. new invention is composed of multiple special submodels, on the whole, model is not only only capable of the document of locality Interior context semantic information is encoded, moreover it is possible to corpus rank by potential topic semantic information of overall importance carry out enhancing to It is embedded in final document representation vector；

5. the innovation of the invention consists in that designing a variety of innovation submodels for different semantic informations and being complex as depth Model is practised to learn for document representation.Wherein most important innovation is the attention mechanism and topic of " sequence to tree " structure The design of information similarity tied mechanism.By experiment on different data sets show document representation that the present invention generates to Amount performance in the big main text mining task of text classification, topic detection and text cluster three is superior to other classics control moulds Type illustrates that the present invention can improve the quality of text representation vector conscientiously.

Detailed description of the invention

Fig. 1 is general levels structural framing figure of the invention.

Fig. 2 is attention layer structure chart described in step S3-S4.

Fig. 3 is topic similarity tied mechanism schematic diagram in step S5.

Fig. 4 A is Comparative result of the document representation vector of many algorithms generation in classification experiments.

Fig. 4 B is that topic diversity factor and the correlation of document classification accuracy visualize.

Fig. 5 is that effect of the present invention in topic detection task visualizes.

Fig. 6 be the present invention in text cluster task with the Comparative result of classic algorithm.

Fig. 7 is the text document representation method method flow diagram of the invention based on the enhancing of deep learning topic information.

Specific embodiment

In the present embodiment, the experiment of the text document representation method of the invention based on the enhancing of deep learning topic information exists It is completed on University Of Shanxi's Computer and Information Technology Institute cluster computer, which forms calculating by 5 high-performance computers And management node, network connection use gigabit Ethernet and infiniband 2.5G net.Each node configure eight core CPU and 128GB memory, CPU is intel xeon E3-1230V53.4GMhz dominant frequency, and is furnished with two pieces of NVIDIA GTX1080 high-performance Graphics card can carry out extensive matrix operation and deep learning model training.

By Fig. 1-7 it is found that the present invention has been divided into several submodels to handle different semantic informations, they successively connect simultaneously Finally merged.Learning process mainly comprises the steps that

S1, to the document D={ w being made of in certain corpus containing K topic n word₁,w₂,...,w_nCarry out clearly Reason extracts, the data preprocessing operation of conversion and arrangement, obtains term vector matrix D={ x of document₁,x₂,...,x_n, specifically Step includes:

S11, all text datas are extracted and is cleared up, wherein needing being marked, word if it is English data Desiccation etc. needs to carry out Chinese word segmentation processing if it is Chinese data.The stop words in data is removed, it is very few (small to delete word number In 6 words) document.

S12, word is converted by all words in corpus using the Word2Vec term vector model after big corpus pre-training Vector.Wherein excessively uncommon word (being not present in term vector model) will be rejected.

S13, the label for obtaining training corpus, shared K respectively correspond K topic, and each topic corresponds to a uniqueness again One-hot type vector be used for supervised learning process.These label vectors are mutually right with its pretreated document data It should rise as experimental data.

S2, context potential applications being extracted, the present invention constructs text sequence layer using the sequence relation between word, Sequence form shot and long term memory models (seq-LSTM) are devised, it will be embedded in word in the expression vector of word each in document Contextual information in sequence.Specific steps include:

S21, each gating element state of LSTM is calculated, LSTM gating element plays control action in calculating, is according to input letter Flexible modulation is ceased, input gate, out gate are broadly divided into and forgets three kinds of door, controlling depth study nodal information is defeated respectively Enter, export and the adjusting of historical information, specific calculation are as follows:

Wherein I, F, O and G are input gate, out gate respectively, forget door and nodal information state, and σ indicates that sigmoid swashs Function living, tanh are hyperbolic tangent functions, and Wseq and Bseq are the weight matrix of deep learning neural network respectively and are biased towards Amount, seq expression parameter belong to text sequence layer.It is defeated by historical information and current term vector by the visible all door states of formula Enter to calculate；

S22, LSTM hidden state is calculated.Hidden state is in shot and long term memory models for storing history or other letters The module of breath, formula are as follows:

C_t=I_t·G_t+F_t·C_t-1

Wherein C represents the corresponding hiding nodes state of some word, it is seen that this hidden state is by nodal information and history The influence of hidden state, and they respectively by input gate and forget door adjustings, it is this adjusting be by between vector by member Element multiplication is realized.The hidden state of current word does tradeoff between current input and historic state according to semantic information and adjusts in a word Section；

S23, LSTM node state is calculated.After obtaining the corresponding hidden state of document current word, need to hidden state into Line activating is to obtain the corresponding potential context semantic state of the word:

h_t=O_t·tanh(C_t)

As shown by the equation, activation primitive selects hyperbolic tangent function, and the activation value receives after out gate is adjusted It can be used for subsequent calculating as node state.

S24, recording text sequence layer result.Document D={ x₁,x₂,...,x_nCorresponding language is generated by text sequence layer Adopted state matrix Hs={ h₁,h₂,...,h_nAnd hidden state Matrix C s={ C₁,C₂,...,C_n, the two matrixes have contained text As " crying " term vector in context semantic information in shelves D, such as " happiness to cry " and " sadness to cry " is, but process Since (node state h) is also different for two " crying " they different expression vectors above after sequence layer.

Sequential element is transitioned into higher by the topic information in S3, the context semantic information in order to enhance document, necessity In the topic information of level, this invention proposes new extraction type attention mechanism and is constructed on text sequence layer, As shown in Figure 2.What is often connected in previous attention mechanism is two sequential structures, and what the present invention needed to connect is sequence And tree node, wherein each sequential element represents a position in document word sequence, and each tree node represents a topic. And two structures are Given informations in general attention power mechanism, and extraction type mechanism of the invention is mentioned from Given information It takes out potential information (i.e. topic).Specific step is as follows:

S31, attention intensity is obtained.Attention intensity according to document context semantic information according to following formula calculate and Come:

Wherein W_attWith b_attThe respectively weight matrix of attention layer and bias vector parameter, a_tIt is K dimensional vector, each of which The value of dimension represents t-th of word of document to the attention intensity of corresponding topic.

S32, attention weight matrix is calculated.Obtained after step S31 attention intensity matrix A=a1, A2 ..., an } it is a n × K matrix, first being carried out transposition is K × n, and the meaning of the matrix in this way becomes its every row instruction and works as Attention (expression) intensity of preceding document text sequence in terms of certain topic, such as: " apple " word of certain position in certain document Much degree expressions topic 1, much degree expression topics 2 ... wait (unlike previous attention mechanism, here Topic particular content is not required for specifying, even unknown).

Then this intensity distribution is normalized to by form of probability by following softmax algorithm:

Attention weight matrix A* after finally record normalization:

S4, in topic layer, document that attention weight matrix A* from attention layer and text sequence layer generate Context semantic information Hs realizes fusion.Semantic information Hs merges according to corresponding different attention weights, due to power Reaction indicates that vector is strong and weak to the expression of topic again, thus the potential topic information in script semantic information just enhanced or It emphasizes.Expression of the current document D on all topic directions is ultimately produced, also can be considered its semantic information in all topic spaces On mapping (can be understood as example an article about Apple Inc. in the different topic such as " science and technology ", " economy ", " politics " Which type of looks like under visual angle).As shown in Figure 1, 2, it is all in corpus to indicate that node VTs is corresponded to for the shared K topic of model Topic, and VTCs is the hidden state of VTs that generates, their calculating side due to the deep learning node using LSTM type Formula is as follows:

Wherein VTs and VTCs has K row, and the relevant information of the corresponding topic of each row vector indicates that vector is also right simultaneously Answer the node of a LSTM type.It is all its all context semantic information foundations by the topic expression of the visible document D of formula Its expression intensity weighted sum to topic.

Wherein, the reapective features according to text overall situation topic information and local context semantic information, design multinomial son Structure, and their stackings are compound, it is then whole for learning document semantic expression.Such design is so that different types of language Adopted information can have corresponding module pointedly to be handled, and therefore different since there are great differences between different information The integration of module will not simply be stacked, and design has extraction type attention mechanism to be responsible for semantic modules and topic herein as a result, The bridge joint of intermodule.

S5, the similarity degree between all VT is constrained.As the foregoing description, previous model, which generates topic, indicates it Between there may be convergent tendencies, such as should have plenty of " military affairs " topic in corpus, but model decomposition is " weapon " and " army " Equal topics, and other topics that should occur are forced to merge, such case often has in each huge language of topic number of documents difference In material.It is that VT indicates mathematically excessively close, such K topic between vector that this problem, which is embodied in model of the invention, Information can have significant missing, cause the degeneration of model performance.Therefore in topic layer, the present invention devises unique topic information Similarity tied mechanism is as shown in Figure 3.Wherein, L be length be K form be " one-hot " (certain positional value close to 1, remaining position Close to topic label vector 0), the basic principle of tied mechanism is to enable indicating that vector is generated via topic by training process The gradually similar of vector v and label L is compared, and due to highly orthogonal between L, so indicate between vector also can be by for topic information The big difference degree of flaring.The specific implementation steps are as follows for similarity tied mechanism of the present invention:

S51, topic indicate vector conversion.Topic indicates that the dimension of vector VT and VTC need not be equal to K, in mathematical computations It can not just be compared with topic label L, therefore first have to change by following algorithm their length:

Wherein Ws and Bs is that weight matrix parameter in topic information similarity tied mechanism and bias matrix parameter, σ are same Sample is sigmoid activation primitive, and the length of comparison vector v k is K, and each document shares each vector of K comparison vector and corresponds to One topic.

S52, measuring similarity.The present invention is using the similarity between cross entropy as a comparison vector sum topic label vector Measurement, calculation are as follows:

Work as s_kThe smaller expression of numerical value compares vector v_kWith topic label vector L_kIt is more similar, it at this moment proves to generate v_kTopic Information vector VT_kAnd VTC_kWith other topics, vector contrast difference is bigger.

S53, topic similarity score calculate.After obtaining the similarity score of all topics, they are averaging and is talked about Information is inscribed, similarity comprehensive score S:

S numerical value is smaller, and topic information indicates the similarity between vector with regard to smaller, and topic information redundancy is also just smaller, this The topic information invented in the document representation vector generated may be more comprehensive.The present invention is missed in the training stage by objective function Difference passback and parameter more newly arrive and minimize S value.

S6, topic is indicated that Vector Fusion is the semantic expressiveness vector Rep of document D in expression layer.K is obtained in step s 5 A topic information indicates vector, these topics are indicated vector as the leaf of tree by the present invention in expression layer by tree-shaped LSTM model Child node and final document representation vector Rep are converged from child node by the operation of LSTM type as father node, semantic information Gather in father node, the specific steps are as follows:

S61, tree-shaped LSTM gating element state computation.First calculate input gate, out gate and the node of tree-shaped LSTM father node State, the slightly different presequence parts therewith of algorithm:

Wherein W_tr、B_trIndicate that the weight matrix and bias matrix are located at the expression layer of tree-shaped.By formula as it can be seen that K expression Individual gating element is generated after aggregation of data in vector, there is no the differentiations of different topics, because of all enhanced topics Information has all covered in final state vector I, O and G.

S62, special forgetting door state calculate.Forget door different from remaining gating element, in tree of the invention to play the part of The role for controlling child node to parent information mobility status is drilled, therefore each child node possesses a forgetting door, and node Between forget door calculating be also (for the independence between subject information) independent of each other.For example, k-th of theme child node The specific algorithm for forgeing door state is as follows:

The forgetting door state that above formula illustrates that each topic corresponds to child node is the language contained by topic expression vector Adopted information individually calculates.

S63, hidden state calculate.The hidden state of LSTM node stores historical information in sequential structure, and in tree-shaped The hidden state storage of father node is the information from child node in structure, as step F2 is previously mentioned, these child nodes Information will receive each and forget the control of door and reach father node.Father node passes through these when calculating its hidden state The child node information overregulated combines, specific as follows shown:

S64, document representation vector generate.In this step, first by the hidden state of father node through activation primitive and Out gate obtains node state vector, and the expression vector Rep for finally obtaining current document is adjusted finally by one layer of dimension.Specifically Calculation method is as follows:

H=Otanh (C)

Rep=σ (W_rh+b_r)

Wherein, W_rAnd B_rIt is the parameter of deep learning neural network.Due to document representation vector requirement length may and it is deep Degree study hidden layer dimension is inconsistent, therefore the present invention adds additional a vector length and adjusts operation.

S7, classifier layer and objective function.In order to train model of the invention, the semantic expressiveness vector for obtaining document it Afterwards, these vectors are classified by topic classifier, record sort accuracy, and obtained plus topic similarity index Then the systematic error index of the document D of "current" model returns algorithm by the error of deep learning model, utilizes target letter Number gradient descent method updates model parameter of the invention.Objective function of the invention is as follows:

Wherein, lambda parameter adjustment nicety of grading and topic difference degree, g are the topic category labels of document D, and p is point The classification results that class device is made according to document Rep.

Text representation vector caused by a kind of good expression learning method can be because contain more more accurately semantic letters It ceases and to show more preferably using the natural language processing task of the vector, therefore the text of the most widely used application of the present invention This classification, topic detection and the big task of text cluster three test the document representation vector of generation.

Fig. 4 A and Fig. 4 B are experiment performance of the document representation vector of the invention generated in topic classification, they are respectively Nicety of grading experiment and the experiment of topic information similarity validity.In order to verify the classification performance experiment for indicating vector using three 90% document in corpus is used to train by class text corpus, and rest part is for testting.Select term vector dimension, deep learning Hidden layer dimension and expression vector dimension are respectively 50,100 and 50.Objective function parameters λ=0.2, model learning rate initial value are 0.1, learning method Adagrad.With reference to Fig. 4 A, almost on whole corpus, the accuracy rate of (TE-LSTM) of the invention all compares Other classics comparison algorithms are more preferable, and with topic information similarity tied mechanism (with SC) than without the mechanism The result of (without SC) is more preferable, this illustrates that expression learning method proposed by the present invention can improve the semantic letter indicated in vector Breath amount, and its topic information similarity tied mechanism has obviously played positive effect.In figure 4b, abscissa indicates topic letter Difference degree between breath, numerical value is bigger to illustrate that topic information similarity is lower, and chart ordinate indicates the difference degree section The classification accuracy of interior document.By the curve of Fig. 4 B as it can be seen that as the difference between topic information is bigger, indicate that the classification of vector is quasi- The trend gradually risen is also presented in true rate, this also illustrates the validity of topic information similarity tied mechanism of the invention, it The topic information redundancy for reducing model improves the information representation capability of vector.

Performance of the document representation vector of the invention generated in topic detection task is shown in Fig. 5.It is most left in table Side is model name, is lda2vec, the present invention (no topic information tied mechanism) and the present invention (containing tied mechanism) mould respectively Type.Secondary series is the topic label in corpus, lists 4 in 20 topics.Third column are the topics detected from corpus Keyword, these keywords be in each topic being calculated according to model before criticality ranking 5 vocabulary, in the present invention The criticality of middle word is attention weighted value of the word to theme.Numerical value in last column is connect in line platform Palmetto The topic relevance being calculated after 5 keywords is received, higher score illustrates the semanteme of these keywords about close to them Also it more may originate from the same topic.Analysis chart be not difficult to find out, the present invention no matter from qualitative angle or quantitatively angle all Obvious preferable experimental result is achieved, and similar classification experiment uses the model performance of topic information similarity tied mechanism The more outstanding quality for also demonstrating all designs of the invention and all improving expression vector.

Performance of the document semantic information representation vector of the invention generated in text cluster task is shown in Fig. 6.Table Dendrography habit is one and textual form data is converted into the task for the expression vector that can directly calculate, and generally passes through calculating The semantic information of text can be intuitively embodied, such as vector distance of the closer word of meaning between them is got in term vector It is small.Similarly, the degree of correlation between these documents can also be judged by calculating the distance between document representation vector.Quality is got over Good indicates that the relevance between vector document degree of correlation and vector distance is higher.It is provided with text cluster task accordingly to examine Survey the performance for the vector that the present invention generates.Belong to same topic document be clustered to determine in same cluster then prove these More preferable, the of the invention expression study of vector performance is more excellent.The calculation of numerical value in Fig. 6 is: is calculated in cluster one by one at most Topic document content, this document content is recorded if this topic does not have corresponding cluster, if the topic is For topic through there is corresponding cluster then to select content time high until encountering the topic not yet arranged properly, all clusters have corresponding topic Afterwards, it scores the average value of the document content of all these topics as the text cluster of the model.It is learnt with reference to Fig. 6, this hair Bright expression vector clusters effect is best, and the model for using topic information similarity tied mechanism has obtained most higher assessment Point, it was demonstrated that the document semantic of the invention that can generate better quality indicates vector.

In conclusion the document D={ w being made of for certain in the corpus containing K topic n word₁,w₂,..., w_n, the present invention adopts the following technical solutions:

Term vector matrix D={ x of document is obtained by pre-training₁,x₂,...,x_n, in context sequence, each word Corresponding potential applications h_i=f₁(x_i,h_i-1), h₀=f₁(x₀), wherein f is conversion function.In this way, even if the same word is in difference Their potential applications of context of co-text are not identical (its expression vector of different location is also different in the text for i.e. same word) yet, this Species diversity is exactly to contain the proof of context semantic information.In addition, the f in formula₁It can be neural network node operation.

In terms of topic information acquisition.By potential applications matrix H={ h of document₁,h₂,...,h_nGenerate corresponding attention Power intensity matrix A={ a₁,a₂,...,a_n, wherein a_i=f₂(h_i) to be K dimensional vector represent i-th of word pair in sequence per one-dimensional The attention intensity (or being " expression intensity ") of a certain topic, f₂It is conversion function.Row normalizing will be finally pressed after A matrix transposition Change the power weight matrix A* that gains attention.

In terms of topic information enhancing, document context semantic information and attention weight are combined and generate all words of document The mapping matrix (VT) of topic, VT=f₃(H, A*) wherein f₃It is conversion function, the corresponding topic of every a line of VT represents document D In the information of the topic that contains.After this part, the topic information of document is respectively enhanced.

In terms of topic information control, using the label information across the full corpus of document to the document topic obtained on last stage Information is limited.The corresponding label vector L that each topic has it to fix, such as L_iThis vector is used for VT_iInto Row limitation, specific practice controls neural network classifier similar to supervision message, and L is supervision message, and every kind of topic Corresponding label vector is each other highly orthogonal, and the topic information after the limitation of such label is also natural to each other It can be difference in height.

The enhanced semantic information of topic is fused to document representation vector.It is not in contact with and will return each other between topic In an expression vector, typical tree is constituted.It is different from the common mode combined by power, the shortcoming requirement of weight again All topic vectors will be merged in a manner of more integrating, if this amalgamation mode is f₄, the expression vector of document D is Rep, then Rep=f₄(VT).In training, a classifier is set on Rep, uses the categorization vector training classification of document Device, the more new model in the way of error passback and gradient decline.

As a result, this article this document be not only comprising word order contextual information but also include topic information dense, Real-valued It indicates vector, reduces topic redundancy.

In the description of the present invention, it is to be understood that, term " center ", " longitudinal direction ", " transverse direction ", " length ", " width ", " thickness ", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outside", " up time The orientation or positional relationship of the instructions such as needle ", " counterclockwise ", " axial direction ", " radial direction ", " circumferential direction " be orientation based on the figure or Positional relationship is merely for convenience of description of the present invention and simplification of the description, rather than the device or element of indication or suggestion meaning must There must be specific orientation, be constructed and operated in a specific orientation, therefore be not considered as limiting the invention.

For this purpose, term " first ", " second " are used for descriptive purposes only and cannot be understood as indicating or suggesting relative importance Or implicitly indicate the quantity of indicated technical characteristic.Define " first " as a result, the feature of " second " can be expressed or Implicitly include at least one this feature.In the description of the present invention, the meaning of " plurality " is at least two, such as two, on Deng unless otherwise specifically defined.

In the present invention unless specifically defined or limited otherwise, term " installation ", " connected ", " connection ", " fixation " etc. Term shall be understood in a broad sense, for example, it may be being fixedly connected, may be a detachable connection, or integral；It can be mechanical connect It connects, is also possible to be electrically connected；It can be connected directly, can also can be inside two elements indirectly connected through an intermediary Connection or two elements interaction relationship, unless otherwise restricted clearly.For those of ordinary skill in the art and Speech, the specific meanings of the above terms in the present invention can be understood according to specific conditions.

In the present invention unless specifically defined or limited otherwise, fisrt feature in the second feature " on " or " down " can be with It is that the first and second features directly contact or the first and second features pass through intermediary mediate contact.Moreover, fisrt feature exists Second feature " on ", " top " and " above " but fisrt feature be directly above or diagonally above the second feature, or be merely representative of First feature horizontal height is less than second feature.

In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means particular features, structures, materials, or characteristics described in conjunction with this embodiment or example It is included at least one embodiment or example of the invention.In the present specification, schematic expression of the above terms need not It must be directed to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be any It can be combined in any suitable manner in a or multiple embodiment or examples.In addition, without conflicting with each other, the technology of this field The feature of different embodiments or examples described in this specification and different embodiments or examples can be combined by personnel And combination.

Be illustrated herein in conjunction with Figure of description and specific embodiment be merely used to help understand method of the invention and Core concept.Method of the present invention is not limited to embodiment described in specific embodiment, those skilled in the art according to According to the other embodiment that method and thought of the invention obtain, also belong to the scope of the technical innovation of the present invention.This specification Content should not be construed as limiting the invention.

Claims

1. a kind of text document representation method based on the enhancing of deep learning topic information, which comprises the following steps:

S1, to the document D={ w being made of in certain corpus containing K topic n word₁,w₂,...,w_nCleared up, taken out The data preprocessing operation for taking, converting and arranging obtains term vector matrix D={ x of document₁,x₂,...,x_n}；

S2 constructs text sequence layer using the sequence relation between word, and implementation sequence form shot and long term memory models obtain document Potential applications matrix H s={ h₁,h₂,...,h_n, wherein h_i=f₁(x_i,h_i-1), h₀=f₁(x₀), f₁For neural network node Operation；

The potential applications matrix H s and attention weight matrix A* is realized fusion, obtains all topics of document by S4 Mapping matrix indicates VTs, VTs=f₃(Hs, A*), wherein f₃It is conversion function；

S5 indicates that the similarity degree of VTs constrains using mapping matrix of the label information across document to the topic, obtains The enhanced mapping matrix of topic information indicates VTk；

S6 merges the VTk, obtains the semantic expressiveness vector Rep of document D, wherein Rep=f₄(VTk), wherein f₄For Fusion function；

S7 classifies to the Rep by topic classifier, and is missed according to classification accuracy and topic similarity index Poor index, and the model parameter in step S1~S6 is updated using target function gradient descending method.

2. the text document representation method according to claim 1 based on the enhancing of deep learning topic information, feature exist In, S1 the following steps are included:

S11 is extracted and is cleared up to all text datas, wherein if it is English data, then being marked and stem Change；If it is Chinese data, then Chinese word segmentation processing is carried out；And the stop words in text data is removed, word number is deleted less than six The document of a word；

S12 converts term vector for all words in corpus using the Word2Vec term vector model after big corpus pre-training.

3. the text document representation method according to claim 1 based on the enhancing of deep learning topic information, feature exist In, S2 the following steps are included:

S21, implementation sequence form shot and long term memory models, i.e. LSTM model, calculation is as follows,

Wherein, I, F, O and G are input gate, out gate respectively, forget door and nodal information state, and σ indicates sigmoid activation Function, tanh are hyperbolic tangent functions, and Wseq is the weight matrix of deep learning neural network, and Bseq is deep learning nerve net The bias vector of network, seq expression parameter belong to text sequence layer；

S22 calculates the corresponding hidden state Ct of document current word according to LSTM model, and calculation is as follows,

C′_t=I_t·G_t+F_t·C_t-1

S23 activates hidden state Ct, obtains according to LSTM model and the corresponding hidden state Ct of the document current word The corresponding potential context semantic state of the word is taken, calculation is as follows,

h_t=O_t·tanh(C_t)

S24, recording text sequence layer is as a result, document D={ x₁,x₂,...,x_nBy text sequence layer generate corresponding semantic shape State matrix H s={ h₁,h₂,...,h_nAnd hidden state Matrix C s={ C₁,C₂,...,C_n, the two matrixes have contained document D Interior context semantic information.

4. the text document representation method according to claim 1 based on the enhancing of deep learning topic information, feature exist In, S3 the following steps are included:

S31 obtains attention intensity a according to document D context semantic information_t, calculation is as follows,

Wherein, a_tIt is K dimensional vector, represents t-th of word of document to the attention intensity of corresponding topic, W_attWith b_attRespectively pay attention to The weight matrix and bias vector parameter of power layer；

S32 calculates attention weight matrix；The attention intensity matrix A={ a obtained after step S31₁,a₂,...,a_n} It is a n × K matrix, first being carried out transposition is K × n, that is,

This intensity distribution is normalized to form of probability by following softmax algorithm,

Attention weight matrix A* after finally record normalization is as follows,

5. the text document representation method according to claim 1 based on the enhancing of deep learning topic information, feature exist In, S4 the following steps are included:

Fusion is realized by the potential applications matrix H s and attention weight matrix A*, obtains current document D in all topics Mapping matrix indicate；Wherein, VTs corresponds to K topic all in corpus, and VTCs is the corresponding hidden state of VTs, they Calculation it is as follows:

Wherein VTs and VTCs has K row, and the relevant information of the corresponding topic of each row vector indicates vector.

6. the text document representation method according to claim 1 based on the enhancing of deep learning topic information, feature exist In, S5 the following steps are included:

S51, topic indicate vector conversion, and topic indicates that the dimension of vector VT and VTC need not be equal to K, therefore first has to pass through Algorithm changes their length below:

Wherein, W_sAnd B_sIt is that weight matrix parameter and bias matrix parameter, σ in topic information similarity tied mechanism is similarly Sigmoid activation primitive compares vector v_kLength be K, and each document shares K comparison vector each vector correspondence one Topic；

S52, measuring similarity are calculated using the measuring similarity between cross entropy as a comparison vector sum topic label vector Mode is as follows:

Work as s_kThe smaller expression of numerical value compares vector v_kWith topic label vector L_kIt is more similar, it at this moment proves to generate v_kTopic information to Measure VT_kAnd VTC_kWith other topics, vector contrast difference is bigger, wherein L is that length is the topic label that K form is " one-hot " Vector；Training corpus topic label shares K, respectively corresponds K topic, and each topic corresponds to a unique one- again Hot type vector is used for supervised learning process；These label vectors are corresponded to each other with its pretreated document data As experimental data；

S53, topic similarity score calculates, and after obtaining the similarity score of all topics, they is averaging and obtains topic letter Cease similarity comprehensive score S:

S numerical value is smaller, and topic information redundancy is also just smaller, and the topic information in document representation vector that the present invention generates may It is more comprehensive；The present invention, which is more newly arrived in the training stage by the passback of objective function error and parameter, minimizes S value.

7. the text document representation method according to claim 1 based on the enhancing of deep learning topic information, feature exist In, S6 the following steps are included:

S61, tree-shaped LSTM gating element state computation first calculate input gate, out gate and the node state of tree-shaped LSTM father node, Calculation is as follows,

Wherein W_tr、B_trIndicate that the weight matrix and bias matrix are located at the expression layer of tree-shaped, by formula as it can be seen that K expression vector In aggregation of data after generate individual gating element, there is no the differentiations of different topics, because of all enhanced topic informations It has all covered in final state vector I, O and G；

S62, special forgetting door state calculate, and are different from remaining gating element, and each child node is gathered around in tree-shaped LSTM model structure Have a forgetting door, and between node forget door calculating be also independent of each other, wherein forget door play control child node to The role of parent information mobility status, the calculation that k-th of topic child node forgets door state is as follows,

S63, hidden state calculate, and the hidden state storage of father node is from child node in tree-shaped LSTM model structure Information, father node combine these child node information through overregulating, calculation is such as when calculating its hidden state Under,

S64, document representation vector generate, and the hidden state of father node is obtained node shape through activation primitive and out gate first State vector adjusts the expression vector Rep for finally obtaining current document, the following institute of circular finally by one layer of dimension Show:

H=Otanh (C)

Rep=σ (W_rh+b_r)

Wherein, W_rAnd B_rIt is the parameter of deep learning neural network.

8. the text document representation method according to claim 1 based on the enhancing of deep learning topic information, feature exist In, S7 the following steps are included:

Classifier and objective function are set, by the semantic expressiveness vector Rep of document by topic classifier record sort as a result, simultaneously In addition topic similarity index obtains the systematic error index of current document D, algorithm, benefit are then returned by deep learning error Model parameter is updated with target function gradient descending method, wherein objective function is as follows,

Wherein, lambda parameter adjustment nicety of grading and topic difference degree, g are the topic category labels of document D, and p is classification knot Fruit.

9. a kind of text document based on the enhancing of deep learning topic information indicates device characterized by comprising

Text sequence layer, the text sequence layer are used for the document D being made of in certain corpus containing K topic n word ={ w₁,w₂,...,w_nThe data preprocessing operation being cleared up, extracted, converted and arranged, obtain the term vector matrix D of document ={ x₁,x₂,...,x_n, and by term vector matrix D={ x of document₁,x₂,...,x_nPass through sequence form shot and long term memory mould Type obtains the potential applications matrix H s={ h of document₁,h₂,...,h_n, wherein h_i=f₁(x_i,h_i-1), h₀=f₁(x₀), f₁For Neural network node operation；

Attention layer, the attention layer realize that word grade is clipped to two kinds of topic rank for topic information in extracting and developing text Granular information connects and realizes the function of extracting unknown message from Given information；By the potential applications matrix H s={ h₁, h₂,...,h_nGenerate corresponding attention intensity matrix A={ a₁,a₂,...,a_n, and will be normalized after A matrix transposition by row To attention weight matrix A*, wherein a_i=f₂(h_i), f₂It is conversion function, by the potential applications matrix H s and the attention Power weight matrix A* realizes fusion,

Topic layer, the mapping matrix that the topic layer is used to obtain all topics of document indicate VTs, VTs=f₃(Hs, A*), wherein f₃It is conversion function；And indicate that the similarity degree of VTs carries out using mapping matrix of the label information across document to the topic Constraint, obtaining the enhanced mapping matrix of topic information indicates VTk；

Expression layer, the expression layer obtain the semantic expressiveness vector Rep of document D, wherein Rep for merging to the VTk =f₄(VTk), wherein f₄For fusion function, and classify to the Rep by topic classifier, and according to classification accuracy Error extension is obtained with topic similarity index, and updates model parameter using target function gradient descending method.