CN109241377B

CN109241377B - Text document representation method and device based on deep learning topic information enhancement

Info

Publication number: CN109241377B
Application number: CN201810999545.6A
Authority: CN
Inventors: 张文跃; 王素格; 李德玉
Original assignee: Shanxi University
Current assignee: Shanxi University
Priority date: 2018-08-30
Filing date: 2018-08-30
Publication date: 2021-04-23
Anticipated expiration: 2038-08-30
Also published as: CN109241377A

Abstract

The invention discloses a text document representation method and device based on deep learning topic information enhancement. The method comprises the following steps: and S1, performing data preprocessing operation on the corpus document in the text form. And S2, designing a text sequence layer, and embedding the context information of each word in the word sequence in the expression vector of each word in the document. And S3, transitioning the sequence elements into the topic information of higher level through the attention layer. S4, in the topic layer, a representation of the current document D in all topic directions is generated. And S5, limiting the similarity degree between all topic information. And S6, fusing the topic representation vectors into a semantic representation vector Rep of the document D at a representation layer. S7, updating parameters of Rep through a classifier and an objective function, the method can effectively embed text sequence context semantic information and latent topic information into document representation vectors, and the topic information enhanced representation vectors can significantly improve the performance of a text mining model using the text sequence context semantic information and the latent topic information.

Description

Text document representation method and device based on deep learning topic information enhancement

Technical Field

The invention relates to the field of computer text representation learning, in particular to a text document representation method based on deep learning enhanced topic information enhancement and a text document representation device based on deep learning enhanced topic information enhancement.

Background

Document-level, holistic grasp on text is an important requirement for many text processing tasks. Currently, this problem is generally solved by textual representation learning. The text document level representation learning task mainly aims at constructing a method for converting a text document into a representation vector which can be directly operated by a computer according to the intrinsic semantic information of the text document. Specifically, a document in text form is represented as a fixed-length real-type vector that implies its semantics. Nowadays, document representation learning has become a fundamental and widespread application in the fields of natural language processing, text mining, information extraction, and the like.

The document representation learning methods that are currently most widely used are roughly of three categories, each of which has various disadvantages: (1) based on the "bag of words" (BoW) model, also known as the "vector space model". The expression vectors generated by the model are sparse and non-real, and the vectors are often not good in later application; (2) semantic analysis-based methods, such as a probabilistic latent semantic analysis model and an LDA document theme generation model, ignore context information of word sequences in texts, and restrict semantic carrying capacity of expression vectors; (3) the long-short term memory model (LSTM) based on the recurrent neural network is widely applied to the distributed representation vector generation of text documents. However, ordinary LSTM may not be sufficient to obtain global topic information of the corpus.

The disadvantages of the above approach show the difficulties currently faced by the document representation learning task: when the model is based on topic information of corpus globality, context information in documents is often lost (for example, the word "apple" cannot be determined without the context information to refer to fruit or science and technology companies), while when the model is focused on the local information, the globality topic information is ignored (correlation among documents), and in addition, no restriction mechanism exists among the topic information, so that the topic information tends to be similar, and the performance of the model is reduced (for example, a topic group with redundancy situations such as "economy", "entertainment", "chariot", "warship" is separated). All of these deficiencies can leave the representation vectors of the document lacking some semantic information, which can limit the effectiveness of these representation vectors in other applications at a later time.

Disclosure of Invention

The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.

Therefore, an object of the present invention is to provide a text document representation method for enhancing topic information enhancement based on deep learning, which can enable a text document to generate a dense and real representation vector containing word order context information and topic information.

Another object of the present invention is to provide a text document presentation apparatus enhanced based on deep learning enhanced topic information.

In order to achieve the above object, an embodiment of the invention provides a text document representation method based on deep learning enhanced topic information enhancement, which includes the following steps:

s1, for a document D consisting of n words in a corpus containing K topics { w ═ w₁,w₂,...,w_nCarrying out data preprocessing operations of cleaning, extracting, converting and sorting to obtain a word vector matrix D of the document (x)₁,x₂,...,x_n}；

S2, constructing a text sequence layer by using the sequence relation among the words, designing a sequence form long-term and short-term memory model, and acquiring a potential semantic matrix Hs (h) of the document₁,h₂,...,h_nIn which h_i＝f₁(x_i,h_i-1)，h₀＝f₁(x₀)，f₁Operating for a neural network node;

s3, using the latent semantic matrix Hs ═ { h ═ h₁,h₂,...,h_nGenerate a corresponding attention intensity matrix a ═ a₁,a₂,...,a_nAnd converting the A matrix and then normalizing the A matrix according to the rows to obtain an attention weight matrix A, wherein a_i＝f₂(h_i)，f₂Is a transformation function;

s4, fusing the latent semantic matrix Hs and the attention weight matrix A, and obtaining a mapping matrix expression VTs of all topics of the document, wherein VTs is f₃(Hs, A), wherein f₃Is a transformation function;

s5, using the label information of the cross-document to restrain the similarity degree of VTs represented by the mapping matrix of the topic, and obtaining a mapping matrix representation VTk after topic information enhancement;

s6, forVTk are fused to obtain a semantic representation vector Rep of the document D, wherein Rep is f₄(VTk) wherein f₄Is a fusion function;

s7, classifying the Rep by a topic classifier, obtaining an error index according to classification accuracy and topic similarity indexes, and updating the model parameters in the steps S1-S6 by using an objective function gradient descent method.

According to the text document representation method based on deep learning enhanced topic information enhancement provided by the embodiment of the invention, firstly, words in a text form are converted into word vectors by using a word embedding technology, so that the form of a document is changed into a real number matrix, and then a text sequence layer is established according to the characteristic of text context semantic information sequentiality. After the real matrix of the document passes through the sequence layer, the real matrix becomes a potential semantic matrix with context semantic information. And then, calculating an attention weight matrix corresponding to the latent semantic matrix by using the latent semantic matrix, and realizing the enhancement of topic information with higher granularity by fusing the latent semantic matrix and the attention weight matrix. Then, topic similarity constraint mechanism is used to make the topics be distinguished from each other as much as possible, so as to obtain all topic representations of the document. And finally, fusing all topic representations to serve as the enhanced representation vector of the topic information of the document, so that the text document is a dense and real representation vector containing word sequence context information and topic information, and the topic redundancy is reduced.

In order to achieve the above object, another embodiment of the present invention provides a text document representation apparatus enhanced based on deep learning enhanced topic information, including a text sequence layer, an attention layer, a topic layer and a representation layer, wherein the text sequence layer is used for D ═ w { of n words in a corpus containing K topics₁,w₂,...,w_nCarrying out data preprocessing operations of cleaning, extracting, converting and sorting to obtain a word vector matrix D of the document (x)₁,x₂,...,x_nAnd the word vector matrix D of the document is set as { x }₁,x₂,...,x_nAcquiring a potential semantic matrix Hs (h) of the document through a sequence form long-term and short-term memory model₁,h₂,...,h_nIn which h_i＝f₁(x_i,h_i-1)，h₀＝f₁(x₀)，f₁Operating for a neural network node; the attention layer is used for extracting and separating topic information in the text, realizing connection of two kinds of granularity information from a word level to a topic level and realizing a function of extracting unknown information from known information; from the latent semantic matrix Hs ═ h₁,h₂,...,h_nGenerate a corresponding attention intensity matrix a ═ a₁,a₂,...,a_nAnd converting the A matrix and then normalizing the A matrix according to the rows to obtain an attention weight matrix A, wherein a_i＝f₂(h_i)，f₂The topic layer is used for obtaining mapping matrix expression VTs of all topics of the document, and the VTs is f₃(Hs, A), wherein f₃Is a transformation function; and the similarity degree of the mapping matrix representation VTs of the topic is restrained by using the label information of the cross-document, and the mapping matrix representation VTk after topic information enhancement is obtained; the representation layer is used for fusing the VTk to obtain a semantic representation vector Rep of the document D, wherein Rep is f₄(VTk) wherein f₄And fusing functions, classifying the Rep by a topic classifier, obtaining an error index according to classification accuracy and topic similarity indexes, and updating model parameters by using a target function gradient descent method.

According to the text document representation device based on deep learning enhanced topic information enhancement provided by the embodiment of the invention, firstly, words in a text form are converted into word vectors by using a word embedding technology, so that the form of a document is changed into a real number matrix, and then a text sequence layer is established according to the characteristic of text context semantic information sequentiality. After the real matrix of the document passes through the sequence layer, the real matrix becomes a potential semantic matrix with context semantic information. And then calculating an attention weight matrix corresponding to the latent semantic matrix at the attention layer by using the latent semantic matrix, and realizing the enhancement of topic information with higher granularity by fusing the latent semantic matrix and the attention weight matrix. Then, topic similarity constraint mechanism is used at topic layer to make the topics be distinguished from each other as much as possible, so as to obtain all topic representations of the document. And finally, fusing all topic representations to serve as the enhanced representation vector of the topic information of the document, so that the text document is a dense and real representation vector containing word sequence context information and topic information, and the topic redundancy is reduced.

Compared with the prior art, the invention has the following beneficial effects:

1. a sequence LSTM model is adopted to model the word sequence of the text, so that the model can better fuse the context information of the text;

2. the extraction-type attention mechanism of the entirely new structure supports a "sequence-to-tree" structure process, which is used to extract topic information from text sequence information. In addition, the method not only can embed the word-topic associated information of the text into a representation vector, but also can explicitly return the support degree of words in the document to different topics, and can be used as a visual result for displaying and testing;

3. the introduction of the similarity constraint mechanism of the topic layer improves the 'long tail effect' of the original topic model, namely, the model is degraded due to the fact that some topics are too similar. Meanwhile, the problem of the convergence faced by the general attention mechanism is also solved. The tendency is caused by too few variables in the process of calculating attention, so that the attention weight distribution of all topics tends to be the same, and a similarity constraint mechanism adds the variables to the calculation process;

4. the novel document expression vector generation method is formed by combining a plurality of special sub-models, and in the whole view, the models can not only encode the context semantic information in the local document, but also enhance the global latent topic semantic information at the corpus level so as to be embedded into the final document expression vector;

5. the innovation point of the method is that various innovation submodels are designed according to different semantic information and are compounded into a deep learning model for document representation learning. The most important innovation is the attention mechanism of the sequence-to-tree structure and the design of the topic information similarity constraint mechanism. Experiments on different data sets show that the document expression vector generated by the method is superior to other classical comparison models in three main text mining tasks of text classification, topic identification and text clustering, and the method can practically improve the quality of the text expression vector.

Drawings

FIG. 1 is an overall hierarchy framework diagram of the present invention.

FIG. 2 is a view of the attention layer structure described in steps S3-S4.

Fig. 3 is a schematic diagram of the topic similarity constraint mechanism in step S5.

FIG. 4A is a comparison of results of various algorithm-generated document representation vectors in a classification experiment.

Fig. 4B is a visualization display of the relevance of topic difference and document classification accuracy.

FIG. 5 is a visual display of the effect of the invention on the topic identification task.

FIG. 6 is a comparison of the results of the present invention with classical algorithms in a text clustering task.

FIG. 7 is a flowchart of a text document representation method based on deep learning topic information enhancement according to the present invention.

Detailed Description

In this embodiment, the experiment of the text document representation method based on deep learning topic information enhancement of the present invention is completed on a computer of Shanxi university and a cluster computer of information technology institute, the cluster is composed of 5 high performance computers to form a computing and managing node, and the network connection adopts gigabit Ethernet and infiniband 2.5G network. Each node is provided with an eight-core CPU and a 128GB memory, the CPU is intel xeon E3-1230V53.4GMhz main frequency and is provided with two NVIDIA GTX1080 high-performance graphics cards, and large-scale matrix operation and deep learning model training can be carried out.

As can be seen from fig. 1 to 7, the present invention divides different semantic information into several sub-models, which are connected layer by layer and finally fused. The learning process mainly comprises the following steps:

s1, for a document D consisting of n words in a corpus containing K topics { w ═ w₁,w₂,...,w_nCleaning, extracting and rotatingPreprocessing the data to obtain the word vector matrix D ═ x of the document₁,x₂,...,x_nThe method comprises the following specific steps:

and S11, extracting and cleaning all text data, wherein if the text data is English data, marking, word stem and the like are needed, and if the text data is Chinese data, Chinese word segmentation is needed. And removing stop words in the data and deleting the document with too few words (less than 6 words).

S12, converting all words in the corpus into Word vectors by using the Word2Vec Word vector model after being pre-trained by the big corpus. Where the obscure words (not present in the word vector model) are discarded.

S13, acquiring labels of the training corpus, wherein K are respectively corresponding to K topics, and each topic is corresponding to a unique one-hot form vector for supervising the learning process. These tag vectors and the document data after the preprocessing thereof are associated with each other as experimental data.

S2, for potential semantic extraction of context, the invention constructs a text sequence layer by using sequence relation among words, designs a sequence form long-short term memory model (seq-LSTM), and embeds context information of each word in the word sequence in the expression vector of each word in the document. The method comprises the following specific steps:

s21, calculating states of all gate elements of the LSTM, wherein the LSTM gate elements play a control role in calculation, are flexibly adjusted according to input information and are mainly divided into an input gate, an output gate and a forgetting gate, and the input and the output of deep learning node information and the adjustment of historical information are respectively controlled, and the specific calculation mode is as follows:

i, F, O and G are input gates, output gates, forgetting gates and node information states respectively, sigma represents a sigmoid activation function, tanh is a hyperbolic tangent function, Wseq and Bseq are weight matrixes and offset vectors of the deep learning neural network respectively, and seq represents that parameters belong to a text sequence layer. All the gate states are calculated by historical information and current word vector input according to the formula;

s22, calculating the LSTM hidden state. The hidden state is a module in the long-short term memory model for storing history or other information, and is calculated as follows:

C_t＝I_t·G_t+F_t·C_t-1

wherein C represents a node hidden state corresponding to a certain word, and the hidden state is influenced by node information and a historical hidden state which are respectively regulated by an input gate and a forgetting gate, and the regulation is realized by element multiplication among vectors. In a word, the hidden state of the current word is balanced and adjusted between the current input state and the historical state according to semantic information;

and S23, calculating the LSTM node state. After obtaining the hidden state corresponding to the current word of the document, activating the hidden state to obtain the latent context semantic state corresponding to the word:

h_t＝O_t·tanh(C_t)

as shown in the formula, the activation function selects the hyperbolic tangent function, and the activation value is used as the node state for subsequent calculation after being adjusted by the output gate.

And S24, recording the text sequence layer result. Document D ═ { x₁,x₂,...,x_nH, generating a corresponding semantic state matrix Hs (h) through a text sequence layer₁,h₂,...,h_nC and a hidden state matrix Cs ═ C₁,C₂,...,C_nThe two matrixes contain context semantic information in the document D, for example, vectors of the words "cry" in "happy to cry" and "cry" in "sad to cry" are the same, but the expression vectors (node states h) of the two words are different due to the difference of the two words after the sequence layer.

S3, in order to enhance the topic information in the context semantic information of the document, the sequence elements are necessarily transited into the topic information of higher level, for which the invention proposes a new extraction type attention mechanism and constructs on the text sequence layer, as shown in FIG. 2. While the past attention mechanism often concatenated two sequence structures, the present invention requires concatenating sequences and tree nodes, where each sequence element represents a position in the sequence of document words and each tree node represents a topic. And it is generally noted that both structures in the mechanism are known information, whereas the extraction mechanism of the present invention extracts potential information (i.e., topics) from the known information. The method comprises the following specific steps:

and S31, acquiring attention intensity. The attention intensity is calculated according to the following formula according to the document context semantic information:

wherein W_attAnd b_attWeight matrix and offset vector parameters, a, of the attention layer, respectively_tIs a K-dimensional vector, and the value of each dimension of the K-dimensional vector represents the attention intensity of the t-th word of the document to the corresponding topic.

And S32, calculating an attention weight matrix. The attention intensity matrix a obtained after the step S31 is an n × K matrix, which is first transformed into K × n, so that the meaning of the matrix becomes the attention (expression) intensity of each line indicating the topic of the current document text sequence, for example: the word "apple" at a certain position in a document expresses topic 1, topic 2 … …, and the like to a certain extent (unlike the conventional attention mechanism, the specific contents of the topics are not required to be specified, and are unknown).

This intensity distribution is then normalized to the form of a probability distribution by the following softmax algorithm:

finally, recording the normalized attention weight matrix A:

s4, in the topic layer, the attention weight matrix A from the attention layer is fused with the document context semantic information Hs generated by the text sequence layer. The semantic information Hs is merged according to the corresponding different attention weights, and because the expression of the weight reaction expression vector to the topic is strong and weak, the potential topic information in the original semantic information is enhanced or emphasized. Finally, a representation of the current document D in all topic directions is generated, which can also be regarded as a mapping of its semantic information in all topic spaces (which can be understood as what an article about apple company looks like in different topic perspectives such as "science", "economy", "politics"). As shown in fig. 1 and 2, the model has K topics representing nodes VTs corresponding to all topics in the corpus, and VTCs are hidden states of VTs generated by using LSTM type deep learning nodes, and they are calculated as follows:

the VTs and the VTCs are provided with K rows, and each row vector corresponds to a topic related information representation vector and also corresponds to an LSTM type node. The topic representation of the document D is obtained by weighting and summing all the context semantic information of the document D according to the expression intensity of the topic representation.

According to the characteristics of the global topic information and the local context semantic information of the text, a plurality of substructures are designed, and are laminated and compounded, and then the overall structure is used for learning document semantic representation. Due to the design, different types of semantic information can be processed with corresponding modules in a targeted mode, and due to the fact that different types of information are different greatly, integration of different modules cannot be achieved simply through stacking, and therefore the design is provided with an extraction type attention mechanism to be responsible for bridging between the semantic modules and the topic modules.

S5, constraining the degree of similarity between all VTs. As described above, there may be a tendency of convergence between the previous model-generated topic representations, for example, there should be a "military" topic in the corpus, but the model is decomposed into topics such as "weapons" and "military", and other topics should be merged, which often appears in the corpus with a large difference in the number of topic documents. The problem is reflected in the model of the invention, namely, VT expression vectors are mathematically too similar, so that K topic information can be obviously lost, and the performance of the model is degraded. Therefore, at the topic level, the invention designs a unique topic information similarity constraint mechanism as shown in fig. 3. Wherein, L is a topic label vector with the length of K and the form of 'one-hot' (the value of a certain position is close to 1, and the rest positions are close to 0), the basic principle of the constraint mechanism is that a contrast vector v generated by the topic representation vector is gradually similar to the label L through the training process, and the difference degree between the topic information representation vectors can be gradually enlarged due to the high orthogonality between the L. The concrete implementation steps of the similarity constraint mechanism are as follows:

and S51, converting topic expression vectors. The dimensions of the topic representation vectors VT and VTC are not necessarily equal to K and are computationally infeasible to compare with the topic label L, so their lengths are first transformed by the following algorithm:

ws and Bs are weight matrix parameters and bias matrix parameters in a topic information similarity constraint mechanism, sigma is a sigmoid activation function, the length of a contrast vector vk is K, and each document has K contrast vectors and each vector corresponds to one topic.

And S52, similarity measurement. The invention adopts the cross entropy as the similarity measurement between the contrast vector and the topic label vector, and the calculation mode is as follows:

when s is_kSmaller values represent contrast vectors v_kAnd topic tag vector L_kThe more similar, the generation of v is now demonstrated_kTopic information vector VT of_kAnd VTC_kThe greater the difference with other topic vectors.

And S53, calculating topic similarity scores. After the similarity scores of all topics are obtained, averaging the similarity scores to obtain topic information, wherein the similarity comprehensive score S is as follows:

the smaller the S value is, the smaller the similarity between topic information representation vectors is, the smaller the topic information redundancy is, and the more comprehensive the topic information in the document representation vectors generated by the invention is. The present invention minimizes the S value by objective function error feedback and parameter update during the training phase.

And S6, fusing the topic representation vectors into a semantic representation vector Rep of the document D at a representation layer. In step S5, K topic information expression vectors are obtained, the topic expression vectors are used as leaf nodes of a tree and the final document expression vector Rep is used as a parent node in a presentation layer through a tree LSTM model, and semantic information is gathered from child nodes to the parent node through LSTM type operation, which specifically comprises the following steps:

s61, tree LSTM gate element state calculation. The input gate, output gate and node state of tree LSTM father node are calculated first, the algorithm is slightly different from the previous sequence part:

wherein W_tr、B_trAnd indicating that the weight matrix and the bias matrix are positioned at the representation layer of the tree type. As can be seen from the equation, the data in the K representative vectors are integrated to generate a single gate element, no longer having a differenceTopic differentiation, as all enhanced topic information is already contained in the final state vectors I, O and G.

And S62, calculating the special forgetting door state. Unlike the rest of the gate elements, the forgetting gate in the tree structure of the present invention plays a role of controlling the information flow from the child node to the parent node, so that each child node has one forgetting gate, and the calculation of the forgetting gates between nodes is independent from each other (due to the independence between the subject information). For example, a specific algorithm for forgetting the state of the gate by the kth topic child node is as follows:

the expression shows that the forgetting gate state of the corresponding child node of each topic is calculated by the semantic information contained in the topic expression vector.

And S63, hidden state calculation. The hidden state of the LSTM node in the sequence structure stores history information and the hidden state of the parent node in the tree structure stores information from child nodes that are controlled by their respective forgetting gates to reach the parent node, as mentioned in step F2. When the parent node calculates its hidden state, the adjusted child node information is combined as follows:

and S64, generating a document representation vector. In the step, firstly, the hidden state of the father node is processed by an activation function and an output gate to obtain a node state vector, and finally, a representation vector Rep of the current document is obtained by one-layer dimension adjustment. The specific calculation method is as follows:

h＝O·tanh(C)

Rep＝σ(W_rh+b_r)

wherein, W_rAnd B_rAre parameters of a deep learning neural network. Due to document representationThe required length of the vector may not be consistent with the dimension of the deep learning hidden layer, so the invention additionally adds a vector length adjusting operation.

S7, classifier layer and objective function. In order to train the model of the invention, after semantic expression vectors of the document are obtained, the vectors are classified through a topic classifier, the classification accuracy is recorded, a topic similarity index is added to obtain a system error index of the document D of the current model, and then the model parameters of the invention are updated by an error feedback algorithm of a deep learning model and a target function gradient descent method. The objective function of the present invention is as follows:

and b, adjusting the balance of classification precision and topic difference degree by the lambda parameter, wherein g is a topic category mark of the document D, and p is a classification result made by the classifier according to the document Rep.

The text expression vector generated by a good expression learning method can make a natural language processing task applying the vector perform better because the text expression vector contains more and more accurate semantic information, so that the invention tests the generated document expression vector by using three tasks of text classification, topic detection and text clustering which are most widely applied.

Fig. 4A and 4B are experimental performances of the document expression vectors generated by the present invention in topic classification, which are a classification precision experiment and a topic information similarity validity experiment, respectively. To verify the classification performance of the representative vectors, the experiment used three classes of text corpora, with 90% of the documents in the corpora used for training and the remainder used for testing. The choice word vector dimension, the deep learning hidden layer dimension, and the representation vector dimension are 50, 100, and 50, respectively. The parameter λ of the objective function is 0.2, the initial value of the model learning rate is 0.1, and the learning method is adagard. Referring to fig. 4A, almost on all corpora, the accuracy of the invention (TE-LSTM) is better than that of other classical contrast algorithms, and the result with the topic information similarity constraint mechanism (with SC) is better than that without the mechanism (with SC), which shows that the representation learning method proposed by the invention can improve the semantic information amount in the representation vector, and the topic information similarity constraint mechanism obviously plays a positive role. In fig. 4B, the abscissa indicates the degree of difference between topic information, the larger the numerical value is, the lower the topic information similarity is, and the ordinate of the graph indicates the classification accuracy of the document in the degree of difference interval. As can be seen from the curve of fig. 4B, as the difference between the topic information is larger, the classification accuracy of the expression vector also shows a trend of increasing gradually, which also illustrates the effectiveness of the topic information similarity constraint mechanism of the present invention, which reduces the topic information redundancy of the model and improves the information expression capability of the vector.

FIG. 5 shows the presentation of document representation vectors generated by the present invention in a topic identification task. The left-most side in the table is the model name, which is lda2vec, the invention (no topic information constraint mechanism) and the invention (with constraint mechanism) model. The second column is the topic tag in the corpus, listing 4 of the 20 topics. The third column is keywords of topics detected from the corpus, the keywords are words with a degree of criticality ranking 5 in each topic obtained by model calculation, and the degree of criticality of a word in the invention is the attention weight value of the word to the topic. The values in the last column are topic relevance calculated after the online platform Palmetto receives 5 keywords, and the higher score indicates that the semantics of the keywords are close to that they are more likely to originate from the same topic. The analysis chart shows that the method obtains obviously better experiment results from both qualitative and quantitative angles, and the similar classification experiment proves that the quality of the expression vector is improved by all designs of the method, because the model performance of the topic information similarity constraint mechanism is more excellent.

FIG. 6 shows the representation of the document semantic information representation vector generated by the present invention in the text clustering task. Representation learning is a task of converting text form data into a representation vector which can be directly calculated, and semantic information of text can be intuitively reflected through calculation, for example, the closer words in a word vector are, the smaller the vector distance between the words is, the more closely the words are. Similarly, the degree of correlation between the documents can also be determined by calculating the distance between the document representation vectors. The better the quality, the higher the correlation between the degree of relevance of the representation vector document and the vector distance. Accordingly, a text clustering task is set to detect the performance of the vectors generated by the invention. The more documents belonging to the same topic are clustered and judged in the same cluster, the better the vector performance is proved, and the representation learning of the invention is more excellent. The numerical values in fig. 6 are calculated in the following manner: calculating the document content of the most topics in the clusters one by one, recording the document content if the topic does not have a corresponding cluster, selecting the topic with the second highest content until meeting the topic which is not arranged, and taking the average value of the document content of all the topics as the text cluster score of the model after all the clusters have the corresponding topics. Referring to fig. 6, it is known that the expression vector of the present invention has the best clustering effect, and the model using the topic information similarity constraint mechanism obtains the highest score, which proves that the present invention can generate the document semantic expression vector with better quality.

In summary, for a document D consisting of n words in a corpus containing K topics, { w ═ w₁,w₂,...,w_nThe invention adopts the following technical scheme:

obtaining word vector matrix D ═ { x) of document through pre-training₁,x₂,...,x_nIn the context sequence, the potential semantic h corresponding to each word_i＝f₁(x_i,h_i-1)，h₀＝f₁(x₀) Where f is the conversion function. Thus, even if the same word has different latent semantics in different context (i.e. the same word has different representation vectors in different positions in the text), the difference is just the proof of the implication of the context semantic information. Further, f in the formula₁It may be a neural network node operation.

In topic information acquisition. From the underlying semantic matrix H ═ H of the document₁,h₂,...,h_nGenerate a corresponding attention intensity matrix a ═ a₁,a₂,...,a_n} of whichIn (a)_i＝f₂(h_i) Is that each dimension of the K-dimension vector represents the attention intensity (or expression intensity) of the ith word in the sequence to a certain topic, f₂Is a conversion function. And finally, the A matrix is rotated and then is normalized according to rows to obtain an attention weight matrix A.

In the aspect of topic information enhancement, the context semantic information and the attention weight of the document are combined to generate a mapping matrix (VT) of all topics of the document, wherein the VT is f₃(H, A) wherein f₃Each line of VT corresponds to a topic, and represents the information of the topic contained in the document D. After this portion, the topic information of the document has been individually enhanced.

In the aspect of topic information control, the document topic information obtained in the last stage is limited by using label information of the whole corpus of the cross-document. Each topic has its fixed corresponding label vector L, e.g., L_iThis vector is used to pair VT_iAnd limiting, namely controlling the neural network classifier by using similar supervision information, wherein L is the supervision information, label vectors corresponding to each topic are highly orthogonal, and the topic information limited by the labels is highly different from each other naturally.

And fusing the semantic information after topic enhancement into a document expression vector. Topics are not related to each other and are all attributed to a representation vector, and a typical tree structure is formed. Different from the common way of combining according to the weight, the lack of the weight requires that all topic vectors are fused in a more comprehensive way, and the fusion way is set as f₄If the expression vector of the document D is Rep, then Rep ═ f₄(VT). During training, a classifier is arranged on the Rep, the classifier is trained by using the class vector of the document, and the model is updated by using an error return and gradient descent mode.

Therefore, the text document is a dense and real expression vector containing word sequence context information and topic information, and topic redundancy is reduced.

In the description of the present invention, it is to be understood that the terms "central," "longitudinal," "lateral," "length," "width," "thickness," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," "clockwise," "counterclockwise," "axial," "radial," "circumferential," and the like are used in the orientations and positional relationships indicated in the drawings for convenience in describing the invention and to simplify the description, and are not intended to indicate or imply that the referenced device or element must have a particular orientation, be constructed and operated in a particular orientation, and are not to be considered limiting of the invention.

For this reason, the terms "first", "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, etc., unless explicitly specified otherwise.

In the present invention, unless otherwise expressly stated or limited, the terms "mounted," "connected," "secured," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integrally formed; can be mechanically or electrically connected; they may be directly connected or indirectly connected through intervening media, or they may be connected internally or in interactive relationship with each other unless otherwise specifically limited. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.

In the present invention, unless otherwise expressly stated or limited, the first feature "on" or "under" the second feature may be directly contacting the first and second features or indirectly contacting the first and second features through an intermediate. Also, a first feature "on," "over," and "above" a second feature may be directly on or obliquely above the second feature, or simply mean that the first feature is at a lesser level than the second feature.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example" or "some examples" or the like are intended to mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

The accompanying drawings and the detailed description are included to provide a further understanding of the invention. The method of the present invention is not limited to the examples described in the specific embodiments, and other embodiments derived from the method and idea of the present invention by those skilled in the art also belong to the technical innovation scope of the present invention. This summary should not be construed to limit the present invention.

Claims

1. A text document representation method based on deep learning topic information enhancement is characterized by comprising the following steps:

s5 includes the steps of:

s51, topic representation vector transformation, the dimensionality of topic representation vectors VT and VTC are not necessarily equal to K, so their lengths are first transformed by the following algorithm:

ws and Bs are weight matrix parameters and bias matrix parameters in a topic information similarity constraint mechanism, sigma is a sigmoid activation function, the length of a contrast vector vk is K, and each document has K contrast vectors and each vector corresponds to a topic;

s52, measuring the similarity, wherein the cross entropy is used as the similarity measurement between the contrast vector and the topic label vector, and the calculation method is as follows:

when the sk value is smaller, the comparison vector vk and the topic label vector Lk are more similar, the comparison difference between the topic information vector VTk and the VTCk which generate vk and other topic vectors is proved to be larger, wherein L is the topic label vector with the length of K in the form of 'one-hot'; k training corpus topic labels are respectively corresponding to K topics, and each topic is corresponding to a unique one-hot form vector for supervising the learning process; the label vectors and the preprocessed document data are mutually corresponding to serve as experimental data;

s53, calculating topic similarity scores, averaging the similarity scores of all topics to obtain a comprehensive topic information similarity score S:

the smaller the S value is, the smaller the redundancy of topic information is, and the more comprehensive the topic information in the generated document expression vector is; minimizing the S value through target function error feedback and parameter updating in a training stage;

s6, fusing the VTk to obtain a semantic expression vector Rep of the document D, wherein Rep is f₄(VTk) wherein f₄Is a fusion function;

2. The text document representation method enhanced based on deep learning topic information as claimed in claim 1, wherein S1 comprises the steps of:

s11, extracting and cleaning all text data, wherein if the text data is English data, marking and stemming are carried out; if the data is Chinese data, performing Chinese word segmentation processing; removing stop words in the text data, and deleting documents with the number of words less than six words;

s12, converting all words in the corpus into Word vectors by using the Word2Vec Word vector model after being pre-trained by the big corpus.

3. The text document representation method enhanced based on deep learning topic information as claimed in claim 1, wherein S2 comprises the steps of:

s21, designing a sequence form long-term and short-term memory model, namely an LSTM model, and calculating in the following way,

i, F, O and G are respectively an input gate, an output gate, a forgetting gate and a node information state, sigma represents a sigmoid activation function, tanh is a hyperbolic tangent function, Wseq is a weight matrix of the deep learning neural network, Bseq is a bias vector of the deep learning neural network, and seq represents that a parameter belongs to a text sequence layer;

s22, calculating the hidden state Ct corresponding to the current word of the document according to the LSTM model in the following way,

C_t＝I_t·G_t+F_t·C_t-1

s23, according to LSTM model and the hidden state Ct corresponding to the current word of the document, activating the hidden state Ct to obtain the potential context semantic state corresponding to the word, the calculation mode is as follows,

h_t＝O_t·tanh(C_t)

s24, recording the text sequence layer result, document D ═ x₁,x₂,...,x_nH, generating a corresponding semantic state matrix Hs (h) through a text sequence layer₁,h₂,...,h_nC and a hidden state matrix Cs ═ C₁,C₂,...,C_nAnd the two matrixes contain context semantic information in the document D.

4. The text document representation method enhanced based on deep learning topic information as claimed in claim 3, wherein S3 comprises the steps of:

s31, obtaining notes according to the context semantic information of the document DStrength of gravity a_tThe calculation method is as follows,

wherein, a_tIs a K-dimensional vector representing the attention intensity of the tth word of the document to the corresponding topic, W_attAnd b_attRespectively are a weight matrix and a bias vector parameter of the attention layer;

s32, calculating an attention weight matrix; the attention intensity matrix a obtained in step S31 is ═ a₁,a₂,...,a_nIs an n x K matrix, which is first transformed into K x n, i.e.,

this intensity distribution is normalized to the form of a probability distribution by the following softmax algorithm,

the normalized attention weight matrix a is finally recorded as follows,

5. the text document representation method enhanced based on deep learning topic information as claimed in claim 4, wherein S4 comprises the steps of:

fusion is realized through a potential semantic matrix Hs and the attention weight matrix A, and mapping matrix representation of the current document D on all topics is obtained; VTs corresponds to all K topics in the corpus, and VTCs is a hidden state corresponding to VTs, and their calculation methods are as follows:

the VTs and the VTCs both have K lines, and each line vector corresponds to a related information representation vector of a topic.

6. The text document representation method enhanced based on deep learning topic information as claimed in claim 5, wherein S6 comprises the steps of:

s61, calculating the states of the input gate, the output gate and the node of the tree LSTM father node by the following method,

wherein W_tr、B_trThe weight matrix and the bias matrix are represented in a tree-shaped representation layer, and as can be seen from the formula, data in the K representation vectors are integrated to generate a single gate element, so that no distinction of different topics is needed any more, because all enhanced topic information is already contained in the final state vectors I, O and G;

s62, calculating the special forgetting gate state, different from other gate elements, each child node in the tree LSTM model structure has a forgetting gate, and the forgetting gate calculation among the nodes is also independent, wherein the forgetting gate plays a role in controlling the information flow condition from the child node to the father node, the forgetting gate state of the child node of the kth topic is calculated in the following way,

s63, calculating hidden state, storing information from child nodes in the hidden state of father node in the tree LSTM model structure, combining the adjusted information of child nodes when father node calculates the hidden state, calculating the hidden state by the following method,

s64, generating a document expression vector, firstly obtaining a node state vector from the hidden state of a father node through an activation function and an output gate, and finally obtaining an expression vector Rep of the current document through one-layer dimension adjustment, wherein the specific calculation method is as follows:

h＝O·tanh(C)

Rep＝σ(W_rh+b_r)

wherein, W_rAnd B_rAre parameters of a deep learning neural network.

7. The text document representation method enhanced based on deep learning topic information as claimed in claim 6, wherein S7 comprises the steps of:

setting a classifier and an objective function, recording a classification result of a semantic expression vector Rep of a document through a topic classifier, adding a topic similarity index to obtain a system error index of a current document D, then updating model parameters by using an objective function gradient descent method through a deep learning error feedback algorithm, wherein the objective function is shown as follows,

wherein, the lambda parameter adjusts and balances the classification precision and the topic difference degree, g is the topic category mark of the document D, and p is the classification result.

8. A text document presentation apparatus enhanced based on deep learning topic information, comprising:

a text sequence layer for matching a document D consisting of n words in a corpus containing K topics { w ═ w₁,w₂,...,w_nCarrying out data preprocessing operations of cleaning, extracting, converting and sorting to obtain a word vector matrix D of the document (x)₁,x₂,...,x_nAnd the word vector matrix D of the document is set as { x }₁,x₂,...,x_nAcquiring a potential semantic matrix Hs (h) of the document through a sequence form long-term and short-term memory model₁,h₂,...,h_nIn which h_i＝f₁(x_i,h_i-1)，h₀＝f₁(x₀)，f₁Operating for a neural network node;

the attention layer is used for extracting and separating topic information in the text, realizing connection of two kinds of granularity information from a word level to a topic level and realizing a function of extracting unknown information from known information; from the latent semantic matrix Hs ═ h₁,h₂,...,h_nGenerate a corresponding attention intensity matrix a ═ a₁,a₂,...,a_nAnd converting the A matrix and then normalizing the A matrix according to the rows to obtain an attention weight matrix A, wherein a_i＝f₂(h_i)，f₂Is a conversion function, and realizes the fusion of the latent semantic matrix Hs and the attention weight matrix A,

a topic layer for obtaining mapping matrix expression VTs of all topics of the document, wherein VTs is f₃(Hs, A), wherein f₃Is a transformation function; and the similarity degree of the mapping matrix representation VTs of the topic is restrained by using the label information of the cross-document, and the mapping matrix representation VTk after topic information enhancement is obtained;

a presentation layer, configured to fuse the VTk to obtain a semantic representation vector Rep of the document D, where Rep ═ f₄(VTk) wherein f₄And fusing functions, classifying the Rep by a topic classifier, obtaining an error index according to classification accuracy and topic similarity indexes, and updating model parameters by using a target function gradient descent method.