CN109241377A - A kind of text document representation method and device based on the enhancing of deep learning topic information - Google Patents

A kind of text document representation method and device based on the enhancing of deep learning topic information Download PDF

Info

Publication number
CN109241377A
CN109241377A CN201810999545.6A CN201810999545A CN109241377A CN 109241377 A CN109241377 A CN 109241377A CN 201810999545 A CN201810999545 A CN 201810999545A CN 109241377 A CN109241377 A CN 109241377A
Authority
CN
China
Prior art keywords
topic
document
vector
information
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810999545.6A
Other languages
Chinese (zh)
Other versions
CN109241377B (en
Inventor
张文跃
王素格
李德玉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanxi University
Original Assignee
Shanxi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanxi University filed Critical Shanxi University
Priority to CN201810999545.6A priority Critical patent/CN109241377B/en
Publication of CN109241377A publication Critical patent/CN109241377A/en
Application granted granted Critical
Publication of CN109241377B publication Critical patent/CN109241377B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a kind of text document representation methods and device based on the enhancing of deep learning topic information.Method includes: S1, carries out data preprocessing operation to the corpus document of textual form.S2, design text sequence layer, will be embedded in its contextual information in word order in the expression vector of word each in document.S3, sequential element is transitioned into higher level topic information by attention layer.S4, in topic layer, generate expression of the current document D on all topic directions.S5, the similarity degree between all topic informations is limited.S6, topic is indicated that Vector Fusion is the semantic expressiveness vector Rep of document D in expression layer.S7, it is updated by classifier and objective function to by the parameter of Rep, text sequence context semantic information and potential topic information can efficiently be embedded into document representation vector by this method, and these expression vectors by topic information enhancing can significantly improve the performance of the text mining mode using them.

Description

A kind of text document representation method and device based on the enhancing of deep learning topic information
Technical field
The present invention relates to computer versions to indicate learning areas, in particular to a kind of to enhance topic information based on deep learning The text document representation method of enhancing and a kind of text document based on deep learning enhancing topic information enhancing indicate device.
Background technique
To text carry out documentation level, globality hold be many text-processing tasks important need.Currently, this One problem is generally solved by text representation study.Text document rank indicates that learning tasks are directed generally to construct a kind of incite somebody to action Text document can be directly the method for the expression vector of Computing according to being converted into it in semantic information.It is specific next It says, is exactly the Real-valued vector for containing its semantic regular length by the document representation of textual form.Nowadays, document representation It practises and has become basic, popularity application in fields such as natural language processing, text mining and information extractions.
Current most widely used document representation learning method substantially has three categories, their each have their own shortcomings: (1) Based on " bag of words " (BoW) model, also referred to as " vector space model ".This class model generate expression vector be it is sparse, Non- real number, this kind of vector is often ineffective in application later;(2) based on the method for semantic analysis, such as " probability is latent In semantic analysis " model, " LDA document subject matter generates model ", this class model has ignored the contextual information of word order in text, this Constrain the semantic carrying capacity for indicating vector;(3) the shot and long term memory models (LSTM) based on Recognition with Recurrent Neural Network are extensive Distributed applied to text document indicates that vector generates.However, common LSTM may be not sufficient to obtain the overall situation of corpus The subject information of property.
The shortcomings that above method, shows the difficulty that document representation learning tasks face at present: when model is based on the corpus overall situation The contextual information being often lost in document when the topic information of property (such as can not just be determined without contextual information " apple " word refers to fruit or scientific & technical corporation), and topic information of overall importance when being absorbed in these local messages Again ignored (correlation between document), furthermore between topic information there is no limit mechanism be also easy to cause they tend to it is similar from And reduce model performance (such as separate " economy ", " amusement ", " battlebus ", " warship " in this way there are the topic groups of redundancy condition). All these defects can make the expression vector of document be short of certain semantic informations, after will limit these indicate vectors at it Effect in his application.
Summary of the invention
The present invention is directed to solve at least some of the technical problems in related technologies.
For this purpose, an object of the present invention is to provide a kind of text texts based on deep learning enhancing topic information enhancing Shelves representation method can make text document generate and not only include word order contextual information but also include the dense, real of topic information The expression vector of number type.
It is another object of the present invention to propose a kind of text document based on deep learning enhancing topic information enhancing Indicate device.
To achieve the above object, one aspect of the present invention embodiment proposes a kind of based on deep learning enhancing topic information increasing Strong text document representation method, comprising the following steps:
S1, to the document D={ w being made of in certain corpus containing K topic n word1,w2,...,wnCarry out clearly Reason extracts, the data preprocessing operation of conversion and arrangement, obtains term vector matrix D={ x of document1,x2,...,xn};
S2 constructs text sequence layer using the sequence relation between word, and implementation sequence form shot and long term memory models obtain The potential applications matrix H s={ h of document1,h2,...,hn, wherein hi=f1(xi,hi-1), h0=f1(x0), f1For neural network Nodal operation;
S3, by the potential applications matrix H s={ h1,h2,...,hnGenerate corresponding attention intensity matrix A={ a1, a2,...,an, and it will gain attention power weight matrix A* after A matrix transposition by row normalization, wherein ai=f2(hi), f2It is to turn Change function;
The potential applications matrix H s and attention weight matrix A* is realized fusion, obtains all words of document by S4 The mapping matrix of topic indicates VTs, VTs=f3(Hs, A*), wherein f3It is conversion function;
S5 indicates that the similarity degree of VTs constrains using mapping matrix of the label information across document to the topic, Obtaining the enhanced mapping matrix of topic information indicates VTk;
S6 merges the VTk, obtains the semantic expressiveness vector Rep of document D, wherein Rep=f4(VTk), In, f4For fusion function;
S7 classifies to the Rep by topic classifier, and is obtained according to classification accuracy and topic similarity index The model parameter in step S1~S6 is updated to error extension, and using target function gradient descending method.
The text document representation method based on deep learning enhancing topic information enhancing proposed according to embodiments of the present invention, Term vector is converted by the word of textual form using word embedded technology first, so that the form of document has become real number matrix, is connect According to the characteristics of text context semantic information sequentiality set up text sequence layer.The real number matrix of document by sequence layer it Afterwards, become the potential applications matrix with context semantic information.It is calculated corresponding thereto followed by potential applications matrix Attention weight matrix, and by both fusion realize the enhancing to the topic information of higher granularity.Then pass through topic Similarity tied mechanism makes should be as distinguishable from one another as possible between topic, so that obtaining all topics of document indicates.Finally The expression of all topics is merged, as the expression vector after the topic information enhancement of the document, this article this document is as a result, Not only included word order contextual information but also included dense, Real-valued the expression vector of topic information, and reduced topic redundancy.
To achieve the above object, another aspect of the present invention embodiment proposes a kind of based on deep learning enhancing topic information The text document of enhancing indicates device, including text sequence layer, attention layer, topic layer and expression layer, wherein the text sequence Column layer is used for the document D={ w being made of in certain corpus containing K topic n word1,w2,...,wnCleared up, taken out The data preprocessing operation for taking, converting and arranging obtains term vector matrix D={ x of document1,x2,...,xn, and by document Term vector matrix D={ x1,x2,...,xnBy sequence form shot and long term memory models, obtain the potential applications matrix H s of document ={ h1,h2,...,hn, wherein hi=f1(xi,hi-1), h0=f1(x0), f1For neural network node operation;The attention Layer realizes that word grade is clipped to connection and the realization of two kinds of granular informations of topic rank for topic information in extracting and developing text The function of unknown message is extracted from Given information;By the potential applications matrix H s={ h1,h2,...,hnGenerate corresponding note Anticipate power intensity matrix A={ a1,a2,...,an, and the power weight matrix A* that gains attention will be normalized by row after A matrix transposition, Middle ai=f2(hi), f2It is conversion function, the potential applications matrix H s and attention weight matrix A* is realized into fusion, The mapping matrix that the topic layer is used to obtain all topics of document indicates VTs, VTs=f3(Hs, A*), wherein f3It is conversion letter Number;And indicate that the similarity degree of VTs constrains using mapping matrix of the label information across document to the topic, obtain words Mapping matrix after inscribing information enhancement indicates VTk;The expression layer obtains the semanteme of document D for merging to the VTk Indicate vector Rep, wherein Rep=f4(VTk), wherein f4For fusion function, and the Rep is divided by topic classifier Class, and error extension is obtained according to classification accuracy and topic similarity index, and more using target function gradient descending method New model parameter.
It is indicated according to the text document based on deep learning enhancing topic information enhancing proposed according to embodiments of the present invention Device converts term vector for the word of textual form using word embedded technology first, so that the form of document has become real number square Battle array sets up text sequence layer the characteristics of then according to text context semantic information sequentiality.The real number matrix of document passes through sequence After column layer, become the potential applications matrix with context semantic information.Then potential applications matrix is utilized in attention layer Attention weight matrix corresponding thereto is calculated, and realizes the increasing to the topic information of higher granularity by the fusion of the two By force.Then being made in topic layer by topic similarity tied mechanism should be as distinguishable from one another as possible between topic, to obtain All topics of document indicate.Finally the expression of all topics is merged, as the table after the topic information enhancement of the document Show vector, as a result, this article this document be not only comprising word order contextual information but also include topic information dense, Real-valued table Show vector, reduces topic redundancy.
Compared with prior art, the invention has the following advantages:
1. sequence LSTM model is used to enable the upper and lower of the model preferably fusing text for the modeling of the word sequence of text Literary information;
2. the extraction type attention mechanism of brand new supports the processing of " sequence to tree " structure, it is used for from text sequence Topic information is extracted in information.Furthermore " word-topic " related information of text can be not only embedded in by it indicates vector, can be with The middle word that having to explicitly returns to document can be used as visualization result to the support of different topics and be shown and test;
3. the introducing of the similarity tied mechanism of topic layer improves " long tail effect " of original topic model, i.e., certain words Topic is excessively similar to enable model degradation.Meanwhile general attention power mechanism faces homoplasy problem and is also resolved.Homoplasy is by counting Variable is very few caused during calculating attention, it makes all topic attention weight distributions tend to identical, and similitude is about Beam mechanism is that its calculating process increases variable;
4. new invention is composed of multiple special submodels, on the whole, model is not only only capable of the document of locality Interior context semantic information is encoded, moreover it is possible to corpus rank by potential topic semantic information of overall importance carry out enhancing to It is embedded in final document representation vector;
5. the innovation of the invention consists in that designing a variety of innovation submodels for different semantic informations and being complex as depth Model is practised to learn for document representation.Wherein most important innovation is the attention mechanism and topic of " sequence to tree " structure The design of information similarity tied mechanism.By experiment on different data sets show document representation that the present invention generates to Amount performance in the big main text mining task of text classification, topic detection and text cluster three is superior to other classics control moulds Type illustrates that the present invention can improve the quality of text representation vector conscientiously.
Detailed description of the invention
Fig. 1 is general levels structural framing figure of the invention.
Fig. 2 is attention layer structure chart described in step S3-S4.
Fig. 3 is topic similarity tied mechanism schematic diagram in step S5.
Fig. 4 A is Comparative result of the document representation vector of many algorithms generation in classification experiments.
Fig. 4 B is that topic diversity factor and the correlation of document classification accuracy visualize.
Fig. 5 is that effect of the present invention in topic detection task visualizes.
Fig. 6 be the present invention in text cluster task with the Comparative result of classic algorithm.
Fig. 7 is the text document representation method method flow diagram of the invention based on the enhancing of deep learning topic information.
Specific embodiment
In the present embodiment, the experiment of the text document representation method of the invention based on the enhancing of deep learning topic information exists It is completed on University Of Shanxi's Computer and Information Technology Institute cluster computer, which forms calculating by 5 high-performance computers And management node, network connection use gigabit Ethernet and infiniband 2.5G net.Each node configure eight core CPU and 128GB memory, CPU is intel xeon E3-1230V53.4GMhz dominant frequency, and is furnished with two pieces of NVIDIA GTX1080 high-performance Graphics card can carry out extensive matrix operation and deep learning model training.
By Fig. 1-7 it is found that the present invention has been divided into several submodels to handle different semantic informations, they successively connect simultaneously Finally merged.Learning process mainly comprises the steps that
S1, to the document D={ w being made of in certain corpus containing K topic n word1,w2,...,wnCarry out clearly Reason extracts, the data preprocessing operation of conversion and arrangement, obtains term vector matrix D={ x of document1,x2,...,xn, specifically Step includes:
S11, all text datas are extracted and is cleared up, wherein needing being marked, word if it is English data Desiccation etc. needs to carry out Chinese word segmentation processing if it is Chinese data.The stop words in data is removed, it is very few (small to delete word number In 6 words) document.
S12, word is converted by all words in corpus using the Word2Vec term vector model after big corpus pre-training Vector.Wherein excessively uncommon word (being not present in term vector model) will be rejected.
S13, the label for obtaining training corpus, shared K respectively correspond K topic, and each topic corresponds to a uniqueness again One-hot type vector be used for supervised learning process.These label vectors are mutually right with its pretreated document data It should rise as experimental data.
S2, context potential applications being extracted, the present invention constructs text sequence layer using the sequence relation between word, Sequence form shot and long term memory models (seq-LSTM) are devised, it will be embedded in word in the expression vector of word each in document Contextual information in sequence.Specific steps include:
S21, each gating element state of LSTM is calculated, LSTM gating element plays control action in calculating, is according to input letter Flexible modulation is ceased, input gate, out gate are broadly divided into and forgets three kinds of door, controlling depth study nodal information is defeated respectively Enter, export and the adjusting of historical information, specific calculation are as follows:
Wherein I, F, O and G are input gate, out gate respectively, forget door and nodal information state, and σ indicates that sigmoid swashs Function living, tanh are hyperbolic tangent functions, and Wseq and Bseq are the weight matrix of deep learning neural network respectively and are biased towards Amount, seq expression parameter belong to text sequence layer.It is defeated by historical information and current term vector by the visible all door states of formula Enter to calculate;
S22, LSTM hidden state is calculated.Hidden state is in shot and long term memory models for storing history or other letters The module of breath, formula are as follows:
Ct=It·Gt+Ft·Ct-1
Wherein C represents the corresponding hiding nodes state of some word, it is seen that this hidden state is by nodal information and history The influence of hidden state, and they respectively by input gate and forget door adjustings, it is this adjusting be by between vector by member Element multiplication is realized.The hidden state of current word does tradeoff between current input and historic state according to semantic information and adjusts in a word Section;
S23, LSTM node state is calculated.After obtaining the corresponding hidden state of document current word, need to hidden state into Line activating is to obtain the corresponding potential context semantic state of the word:
ht=Ot·tanh(Ct)
As shown by the equation, activation primitive selects hyperbolic tangent function, and the activation value receives after out gate is adjusted It can be used for subsequent calculating as node state.
S24, recording text sequence layer result.Document D={ x1,x2,...,xnCorresponding language is generated by text sequence layer Adopted state matrix Hs={ h1,h2,...,hnAnd hidden state Matrix C s={ C1,C2,...,Cn, the two matrixes have contained text As " crying " term vector in context semantic information in shelves D, such as " happiness to cry " and " sadness to cry " is, but process Since (node state h) is also different for two " crying " they different expression vectors above after sequence layer.
Sequential element is transitioned into higher by the topic information in S3, the context semantic information in order to enhance document, necessity In the topic information of level, this invention proposes new extraction type attention mechanism and is constructed on text sequence layer, As shown in Figure 2.What is often connected in previous attention mechanism is two sequential structures, and what the present invention needed to connect is sequence And tree node, wherein each sequential element represents a position in document word sequence, and each tree node represents a topic. And two structures are Given informations in general attention power mechanism, and extraction type mechanism of the invention is mentioned from Given information It takes out potential information (i.e. topic).Specific step is as follows:
S31, attention intensity is obtained.Attention intensity according to document context semantic information according to following formula calculate and Come:
Wherein WattWith battThe respectively weight matrix of attention layer and bias vector parameter, atIt is K dimensional vector, each of which The value of dimension represents t-th of word of document to the attention intensity of corresponding topic.
S32, attention weight matrix is calculated.Obtained after step S31 attention intensity matrix A=a1, A2 ..., an } it is a n × K matrix, first being carried out transposition is K × n, and the meaning of the matrix in this way becomes its every row instruction and works as Attention (expression) intensity of preceding document text sequence in terms of certain topic, such as: " apple " word of certain position in certain document Much degree expressions topic 1, much degree expression topics 2 ... wait (unlike previous attention mechanism, here Topic particular content is not required for specifying, even unknown).
Then this intensity distribution is normalized to by form of probability by following softmax algorithm:
Attention weight matrix A* after finally record normalization:
S4, in topic layer, document that attention weight matrix A* from attention layer and text sequence layer generate Context semantic information Hs realizes fusion.Semantic information Hs merges according to corresponding different attention weights, due to power Reaction indicates that vector is strong and weak to the expression of topic again, thus the potential topic information in script semantic information just enhanced or It emphasizes.Expression of the current document D on all topic directions is ultimately produced, also can be considered its semantic information in all topic spaces On mapping (can be understood as example an article about Apple Inc. in the different topic such as " science and technology ", " economy ", " politics " Which type of looks like under visual angle).As shown in Figure 1, 2, it is all in corpus to indicate that node VTs is corresponded to for the shared K topic of model Topic, and VTCs is the hidden state of VTs that generates, their calculating side due to the deep learning node using LSTM type Formula is as follows:
Wherein VTs and VTCs has K row, and the relevant information of the corresponding topic of each row vector indicates that vector is also right simultaneously Answer the node of a LSTM type.It is all its all context semantic information foundations by the topic expression of the visible document D of formula Its expression intensity weighted sum to topic.
Wherein, the reapective features according to text overall situation topic information and local context semantic information, design multinomial son Structure, and their stackings are compound, it is then whole for learning document semantic expression.Such design is so that different types of language Adopted information can have corresponding module pointedly to be handled, and therefore different since there are great differences between different information The integration of module will not simply be stacked, and design has extraction type attention mechanism to be responsible for semantic modules and topic herein as a result, The bridge joint of intermodule.
S5, the similarity degree between all VT is constrained.As the foregoing description, previous model, which generates topic, indicates it Between there may be convergent tendencies, such as should have plenty of " military affairs " topic in corpus, but model decomposition is " weapon " and " army " Equal topics, and other topics that should occur are forced to merge, such case often has in each huge language of topic number of documents difference In material.It is that VT indicates mathematically excessively close, such K topic between vector that this problem, which is embodied in model of the invention, Information can have significant missing, cause the degeneration of model performance.Therefore in topic layer, the present invention devises unique topic information Similarity tied mechanism is as shown in Figure 3.Wherein, L be length be K form be " one-hot " (certain positional value close to 1, remaining position Close to topic label vector 0), the basic principle of tied mechanism is to enable indicating that vector is generated via topic by training process The gradually similar of vector v and label L is compared, and due to highly orthogonal between L, so indicate between vector also can be by for topic information The big difference degree of flaring.The specific implementation steps are as follows for similarity tied mechanism of the present invention:
S51, topic indicate vector conversion.Topic indicates that the dimension of vector VT and VTC need not be equal to K, in mathematical computations It can not just be compared with topic label L, therefore first have to change by following algorithm their length:
Wherein Ws and Bs is that weight matrix parameter in topic information similarity tied mechanism and bias matrix parameter, σ are same Sample is sigmoid activation primitive, and the length of comparison vector v k is K, and each document shares each vector of K comparison vector and corresponds to One topic.
S52, measuring similarity.The present invention is using the similarity between cross entropy as a comparison vector sum topic label vector Measurement, calculation are as follows:
Work as skThe smaller expression of numerical value compares vector vkWith topic label vector LkIt is more similar, it at this moment proves to generate vkTopic Information vector VTkAnd VTCkWith other topics, vector contrast difference is bigger.
S53, topic similarity score calculate.After obtaining the similarity score of all topics, they are averaging and is talked about Information is inscribed, similarity comprehensive score S:
S numerical value is smaller, and topic information indicates the similarity between vector with regard to smaller, and topic information redundancy is also just smaller, this The topic information invented in the document representation vector generated may be more comprehensive.The present invention is missed in the training stage by objective function Difference passback and parameter more newly arrive and minimize S value.
S6, topic is indicated that Vector Fusion is the semantic expressiveness vector Rep of document D in expression layer.K is obtained in step s 5 A topic information indicates vector, these topics are indicated vector as the leaf of tree by the present invention in expression layer by tree-shaped LSTM model Child node and final document representation vector Rep are converged from child node by the operation of LSTM type as father node, semantic information Gather in father node, the specific steps are as follows:
S61, tree-shaped LSTM gating element state computation.First calculate input gate, out gate and the node of tree-shaped LSTM father node State, the slightly different presequence parts therewith of algorithm:
Wherein Wtr、BtrIndicate that the weight matrix and bias matrix are located at the expression layer of tree-shaped.By formula as it can be seen that K expression Individual gating element is generated after aggregation of data in vector, there is no the differentiations of different topics, because of all enhanced topics Information has all covered in final state vector I, O and G.
S62, special forgetting door state calculate.Forget door different from remaining gating element, in tree of the invention to play the part of The role for controlling child node to parent information mobility status is drilled, therefore each child node possesses a forgetting door, and node Between forget door calculating be also (for the independence between subject information) independent of each other.For example, k-th of theme child node The specific algorithm for forgeing door state is as follows:
The forgetting door state that above formula illustrates that each topic corresponds to child node is the language contained by topic expression vector Adopted information individually calculates.
S63, hidden state calculate.The hidden state of LSTM node stores historical information in sequential structure, and in tree-shaped The hidden state storage of father node is the information from child node in structure, as step F2 is previously mentioned, these child nodes Information will receive each and forget the control of door and reach father node.Father node passes through these when calculating its hidden state The child node information overregulated combines, specific as follows shown:
S64, document representation vector generate.In this step, first by the hidden state of father node through activation primitive and Out gate obtains node state vector, and the expression vector Rep for finally obtaining current document is adjusted finally by one layer of dimension.Specifically Calculation method is as follows:
H=Otanh (C)
Rep=σ (Wrh+br)
Wherein, WrAnd BrIt is the parameter of deep learning neural network.Due to document representation vector requirement length may and it is deep Degree study hidden layer dimension is inconsistent, therefore the present invention adds additional a vector length and adjusts operation.
S7, classifier layer and objective function.In order to train model of the invention, the semantic expressiveness vector for obtaining document it Afterwards, these vectors are classified by topic classifier, record sort accuracy, and obtained plus topic similarity index Then the systematic error index of the document D of "current" model returns algorithm by the error of deep learning model, utilizes target letter Number gradient descent method updates model parameter of the invention.Objective function of the invention is as follows:
Wherein, lambda parameter adjustment nicety of grading and topic difference degree, g are the topic category labels of document D, and p is point The classification results that class device is made according to document Rep.
Text representation vector caused by a kind of good expression learning method can be because contain more more accurately semantic letters It ceases and to show more preferably using the natural language processing task of the vector, therefore the text of the most widely used application of the present invention This classification, topic detection and the big task of text cluster three test the document representation vector of generation.
Fig. 4 A and Fig. 4 B are experiment performance of the document representation vector of the invention generated in topic classification, they are respectively Nicety of grading experiment and the experiment of topic information similarity validity.In order to verify the classification performance experiment for indicating vector using three 90% document in corpus is used to train by class text corpus, and rest part is for testting.Select term vector dimension, deep learning Hidden layer dimension and expression vector dimension are respectively 50,100 and 50.Objective function parameters λ=0.2, model learning rate initial value are 0.1, learning method Adagrad.With reference to Fig. 4 A, almost on whole corpus, the accuracy rate of (TE-LSTM) of the invention all compares Other classics comparison algorithms are more preferable, and with topic information similarity tied mechanism (with SC) than without the mechanism The result of (without SC) is more preferable, this illustrates that expression learning method proposed by the present invention can improve the semantic letter indicated in vector Breath amount, and its topic information similarity tied mechanism has obviously played positive effect.In figure 4b, abscissa indicates topic letter Difference degree between breath, numerical value is bigger to illustrate that topic information similarity is lower, and chart ordinate indicates the difference degree section The classification accuracy of interior document.By the curve of Fig. 4 B as it can be seen that as the difference between topic information is bigger, indicate that the classification of vector is quasi- The trend gradually risen is also presented in true rate, this also illustrates the validity of topic information similarity tied mechanism of the invention, it The topic information redundancy for reducing model improves the information representation capability of vector.
Performance of the document representation vector of the invention generated in topic detection task is shown in Fig. 5.It is most left in table Side is model name, is lda2vec, the present invention (no topic information tied mechanism) and the present invention (containing tied mechanism) mould respectively Type.Secondary series is the topic label in corpus, lists 4 in 20 topics.Third column are the topics detected from corpus Keyword, these keywords be in each topic being calculated according to model before criticality ranking 5 vocabulary, in the present invention The criticality of middle word is attention weighted value of the word to theme.Numerical value in last column is connect in line platform Palmetto The topic relevance being calculated after 5 keywords is received, higher score illustrates the semanteme of these keywords about close to them Also it more may originate from the same topic.Analysis chart be not difficult to find out, the present invention no matter from qualitative angle or quantitatively angle all Obvious preferable experimental result is achieved, and similar classification experiment uses the model performance of topic information similarity tied mechanism The more outstanding quality for also demonstrating all designs of the invention and all improving expression vector.
Performance of the document semantic information representation vector of the invention generated in text cluster task is shown in Fig. 6.Table Dendrography habit is one and textual form data is converted into the task for the expression vector that can directly calculate, and generally passes through calculating The semantic information of text can be intuitively embodied, such as vector distance of the closer word of meaning between them is got in term vector It is small.Similarly, the degree of correlation between these documents can also be judged by calculating the distance between document representation vector.Quality is got over Good indicates that the relevance between vector document degree of correlation and vector distance is higher.It is provided with text cluster task accordingly to examine Survey the performance for the vector that the present invention generates.Belong to same topic document be clustered to determine in same cluster then prove these More preferable, the of the invention expression study of vector performance is more excellent.The calculation of numerical value in Fig. 6 is: is calculated in cluster one by one at most Topic document content, this document content is recorded if this topic does not have corresponding cluster, if the topic is For topic through there is corresponding cluster then to select content time high until encountering the topic not yet arranged properly, all clusters have corresponding topic Afterwards, it scores the average value of the document content of all these topics as the text cluster of the model.It is learnt with reference to Fig. 6, this hair Bright expression vector clusters effect is best, and the model for using topic information similarity tied mechanism has obtained most higher assessment Point, it was demonstrated that the document semantic of the invention that can generate better quality indicates vector.
In conclusion the document D={ w being made of for certain in the corpus containing K topic n word1,w2,..., wn, the present invention adopts the following technical solutions:
Term vector matrix D={ x of document is obtained by pre-training1,x2,...,xn, in context sequence, each word Corresponding potential applications hi=f1(xi,hi-1), h0=f1(x0), wherein f is conversion function.In this way, even if the same word is in difference Their potential applications of context of co-text are not identical (its expression vector of different location is also different in the text for i.e. same word) yet, this Species diversity is exactly to contain the proof of context semantic information.In addition, the f in formula1It can be neural network node operation.
In terms of topic information acquisition.By potential applications matrix H={ h of document1,h2,...,hnGenerate corresponding attention Power intensity matrix A={ a1,a2,...,an, wherein ai=f2(hi) to be K dimensional vector represent i-th of word pair in sequence per one-dimensional The attention intensity (or being " expression intensity ") of a certain topic, f2It is conversion function.Row normalizing will be finally pressed after A matrix transposition Change the power weight matrix A* that gains attention.
In terms of topic information enhancing, document context semantic information and attention weight are combined and generate all words of document The mapping matrix (VT) of topic, VT=f3(H, A*) wherein f3It is conversion function, the corresponding topic of every a line of VT represents document D In the information of the topic that contains.After this part, the topic information of document is respectively enhanced.
In terms of topic information control, using the label information across the full corpus of document to the document topic obtained on last stage Information is limited.The corresponding label vector L that each topic has it to fix, such as LiThis vector is used for VTiInto Row limitation, specific practice controls neural network classifier similar to supervision message, and L is supervision message, and every kind of topic Corresponding label vector is each other highly orthogonal, and the topic information after the limitation of such label is also natural to each other It can be difference in height.
The enhanced semantic information of topic is fused to document representation vector.It is not in contact with and will return each other between topic In an expression vector, typical tree is constituted.It is different from the common mode combined by power, the shortcoming requirement of weight again All topic vectors will be merged in a manner of more integrating, if this amalgamation mode is f4, the expression vector of document D is Rep, then Rep=f4(VT).In training, a classifier is set on Rep, uses the categorization vector training classification of document Device, the more new model in the way of error passback and gradient decline.
As a result, this article this document be not only comprising word order contextual information but also include topic information dense, Real-valued It indicates vector, reduces topic redundancy.
In the description of the present invention, it is to be understood that, term " center ", " longitudinal direction ", " transverse direction ", " length ", " width ", " thickness ", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outside", " up time The orientation or positional relationship of the instructions such as needle ", " counterclockwise ", " axial direction ", " radial direction ", " circumferential direction " be orientation based on the figure or Positional relationship is merely for convenience of description of the present invention and simplification of the description, rather than the device or element of indication or suggestion meaning must There must be specific orientation, be constructed and operated in a specific orientation, therefore be not considered as limiting the invention.
For this purpose, term " first ", " second " are used for descriptive purposes only and cannot be understood as indicating or suggesting relative importance Or implicitly indicate the quantity of indicated technical characteristic.Define " first " as a result, the feature of " second " can be expressed or Implicitly include at least one this feature.In the description of the present invention, the meaning of " plurality " is at least two, such as two, on Deng unless otherwise specifically defined.
In the present invention unless specifically defined or limited otherwise, term " installation ", " connected ", " connection ", " fixation " etc. Term shall be understood in a broad sense, for example, it may be being fixedly connected, may be a detachable connection, or integral;It can be mechanical connect It connects, is also possible to be electrically connected;It can be connected directly, can also can be inside two elements indirectly connected through an intermediary Connection or two elements interaction relationship, unless otherwise restricted clearly.For those of ordinary skill in the art and Speech, the specific meanings of the above terms in the present invention can be understood according to specific conditions.
In the present invention unless specifically defined or limited otherwise, fisrt feature in the second feature " on " or " down " can be with It is that the first and second features directly contact or the first and second features pass through intermediary mediate contact.Moreover, fisrt feature exists Second feature " on ", " top " and " above " but fisrt feature be directly above or diagonally above the second feature, or be merely representative of First feature horizontal height is less than second feature.
In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means particular features, structures, materials, or characteristics described in conjunction with this embodiment or example It is included at least one embodiment or example of the invention.In the present specification, schematic expression of the above terms need not It must be directed to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be any It can be combined in any suitable manner in a or multiple embodiment or examples.In addition, without conflicting with each other, the technology of this field The feature of different embodiments or examples described in this specification and different embodiments or examples can be combined by personnel And combination.
Be illustrated herein in conjunction with Figure of description and specific embodiment be merely used to help understand method of the invention and Core concept.Method of the present invention is not limited to embodiment described in specific embodiment, those skilled in the art according to According to the other embodiment that method and thought of the invention obtain, also belong to the scope of the technical innovation of the present invention.This specification Content should not be construed as limiting the invention.

Claims (9)

1. a kind of text document representation method based on the enhancing of deep learning topic information, which comprises the following steps:
S1, to the document D={ w being made of in certain corpus containing K topic n word1,w2,...,wnCleared up, taken out The data preprocessing operation for taking, converting and arranging obtains term vector matrix D={ x of document1,x2,...,xn};
S2 constructs text sequence layer using the sequence relation between word, and implementation sequence form shot and long term memory models obtain document Potential applications matrix H s={ h1,h2,...,hn, wherein hi=f1(xi,hi-1), h0=f1(x0), f1For neural network node Operation;
S3, by the potential applications matrix H s={ h1,h2,...,hnGenerate corresponding attention intensity matrix A={ a1, a2,...,an, and it will gain attention power weight matrix A* after A matrix transposition by row normalization, wherein ai=f2(hi), f2It is to turn Change function;
The potential applications matrix H s and attention weight matrix A* is realized fusion, obtains all topics of document by S4 Mapping matrix indicates VTs, VTs=f3(Hs, A*), wherein f3It is conversion function;
S5 indicates that the similarity degree of VTs constrains using mapping matrix of the label information across document to the topic, obtains The enhanced mapping matrix of topic information indicates VTk;
S6 merges the VTk, obtains the semantic expressiveness vector Rep of document D, wherein Rep=f4(VTk), wherein f4For Fusion function;
S7 classifies to the Rep by topic classifier, and is missed according to classification accuracy and topic similarity index Poor index, and the model parameter in step S1~S6 is updated using target function gradient descending method.
2. the text document representation method according to claim 1 based on the enhancing of deep learning topic information, feature exist In, S1 the following steps are included:
S11 is extracted and is cleared up to all text datas, wherein if it is English data, then being marked and stem Change;If it is Chinese data, then Chinese word segmentation processing is carried out;And the stop words in text data is removed, word number is deleted less than six The document of a word;
S12 converts term vector for all words in corpus using the Word2Vec term vector model after big corpus pre-training.
3. the text document representation method according to claim 1 based on the enhancing of deep learning topic information, feature exist In, S2 the following steps are included:
S21, implementation sequence form shot and long term memory models, i.e. LSTM model, calculation is as follows,
Wherein, I, F, O and G are input gate, out gate respectively, forget door and nodal information state, and σ indicates sigmoid activation Function, tanh are hyperbolic tangent functions, and Wseq is the weight matrix of deep learning neural network, and Bseq is deep learning nerve net The bias vector of network, seq expression parameter belong to text sequence layer;
S22 calculates the corresponding hidden state Ct of document current word according to LSTM model, and calculation is as follows,
C′t=It·Gt+Ft·Ct-1
S23 activates hidden state Ct, obtains according to LSTM model and the corresponding hidden state Ct of the document current word The corresponding potential context semantic state of the word is taken, calculation is as follows,
ht=Ot·tanh(Ct)
S24, recording text sequence layer is as a result, document D={ x1,x2,...,xnBy text sequence layer generate corresponding semantic shape State matrix H s={ h1,h2,...,hnAnd hidden state Matrix C s={ C1,C2,...,Cn, the two matrixes have contained document D Interior context semantic information.
4. the text document representation method according to claim 1 based on the enhancing of deep learning topic information, feature exist In, S3 the following steps are included:
S31 obtains attention intensity a according to document D context semantic informationt, calculation is as follows,
Wherein, atIt is K dimensional vector, represents t-th of word of document to the attention intensity of corresponding topic, WattWith battRespectively pay attention to The weight matrix and bias vector parameter of power layer;
S32 calculates attention weight matrix;The attention intensity matrix A={ a obtained after step S311,a2,...,an} It is a n × K matrix, first being carried out transposition is K × n, that is,
This intensity distribution is normalized to form of probability by following softmax algorithm,
Attention weight matrix A* after finally record normalization is as follows,
5. the text document representation method according to claim 1 based on the enhancing of deep learning topic information, feature exist In, S4 the following steps are included:
Fusion is realized by the potential applications matrix H s and attention weight matrix A*, obtains current document D in all topics Mapping matrix indicate;Wherein, VTs corresponds to K topic all in corpus, and VTCs is the corresponding hidden state of VTs, they Calculation it is as follows:
Wherein VTs and VTCs has K row, and the relevant information of the corresponding topic of each row vector indicates vector.
6. the text document representation method according to claim 1 based on the enhancing of deep learning topic information, feature exist In, S5 the following steps are included:
S51, topic indicate vector conversion, and topic indicates that the dimension of vector VT and VTC need not be equal to K, therefore first has to pass through Algorithm changes their length below:
Wherein, WsAnd BsIt is that weight matrix parameter and bias matrix parameter, σ in topic information similarity tied mechanism is similarly Sigmoid activation primitive compares vector vkLength be K, and each document shares K comparison vector each vector correspondence one Topic;
S52, measuring similarity are calculated using the measuring similarity between cross entropy as a comparison vector sum topic label vector Mode is as follows:
Work as skThe smaller expression of numerical value compares vector vkWith topic label vector LkIt is more similar, it at this moment proves to generate vkTopic information to Measure VTkAnd VTCkWith other topics, vector contrast difference is bigger, wherein L is that length is the topic label that K form is " one-hot " Vector;Training corpus topic label shares K, respectively corresponds K topic, and each topic corresponds to a unique one- again Hot type vector is used for supervised learning process;These label vectors are corresponded to each other with its pretreated document data As experimental data;
S53, topic similarity score calculates, and after obtaining the similarity score of all topics, they is averaging and obtains topic letter Cease similarity comprehensive score S:
S numerical value is smaller, and topic information redundancy is also just smaller, and the topic information in document representation vector that the present invention generates may It is more comprehensive;The present invention, which is more newly arrived in the training stage by the passback of objective function error and parameter, minimizes S value.
7. the text document representation method according to claim 1 based on the enhancing of deep learning topic information, feature exist In, S6 the following steps are included:
S61, tree-shaped LSTM gating element state computation first calculate input gate, out gate and the node state of tree-shaped LSTM father node, Calculation is as follows,
Wherein Wtr、BtrIndicate that the weight matrix and bias matrix are located at the expression layer of tree-shaped, by formula as it can be seen that K expression vector In aggregation of data after generate individual gating element, there is no the differentiations of different topics, because of all enhanced topic informations It has all covered in final state vector I, O and G;
S62, special forgetting door state calculate, and are different from remaining gating element, and each child node is gathered around in tree-shaped LSTM model structure Have a forgetting door, and between node forget door calculating be also independent of each other, wherein forget door play control child node to The role of parent information mobility status, the calculation that k-th of topic child node forgets door state is as follows,
S63, hidden state calculate, and the hidden state storage of father node is from child node in tree-shaped LSTM model structure Information, father node combine these child node information through overregulating, calculation is such as when calculating its hidden state Under,
S64, document representation vector generate, and the hidden state of father node is obtained node shape through activation primitive and out gate first State vector adjusts the expression vector Rep for finally obtaining current document, the following institute of circular finally by one layer of dimension Show:
H=Otanh (C)
Rep=σ (Wrh+br)
Wherein, WrAnd BrIt is the parameter of deep learning neural network.
8. the text document representation method according to claim 1 based on the enhancing of deep learning topic information, feature exist In, S7 the following steps are included:
Classifier and objective function are set, by the semantic expressiveness vector Rep of document by topic classifier record sort as a result, simultaneously In addition topic similarity index obtains the systematic error index of current document D, algorithm, benefit are then returned by deep learning error Model parameter is updated with target function gradient descending method, wherein objective function is as follows,
Wherein, lambda parameter adjustment nicety of grading and topic difference degree, g are the topic category labels of document D, and p is classification knot Fruit.
9. a kind of text document based on the enhancing of deep learning topic information indicates device characterized by comprising
Text sequence layer, the text sequence layer are used for the document D being made of in certain corpus containing K topic n word ={ w1,w2,...,wnThe data preprocessing operation being cleared up, extracted, converted and arranged, obtain the term vector matrix D of document ={ x1,x2,...,xn, and by term vector matrix D={ x of document1,x2,...,xnPass through sequence form shot and long term memory mould Type obtains the potential applications matrix H s={ h of document1,h2,...,hn, wherein hi=f1(xi,hi-1), h0=f1(x0), f1For Neural network node operation;
Attention layer, the attention layer realize that word grade is clipped to two kinds of topic rank for topic information in extracting and developing text Granular information connects and realizes the function of extracting unknown message from Given information;By the potential applications matrix H s={ h1, h2,...,hnGenerate corresponding attention intensity matrix A={ a1,a2,...,an, and will be normalized after A matrix transposition by row To attention weight matrix A*, wherein ai=f2(hi), f2It is conversion function, by the potential applications matrix H s and the attention Power weight matrix A* realizes fusion,
Topic layer, the mapping matrix that the topic layer is used to obtain all topics of document indicate VTs, VTs=f3(Hs, A*), wherein f3It is conversion function;And indicate that the similarity degree of VTs carries out using mapping matrix of the label information across document to the topic Constraint, obtaining the enhanced mapping matrix of topic information indicates VTk;
Expression layer, the expression layer obtain the semantic expressiveness vector Rep of document D, wherein Rep for merging to the VTk =f4(VTk), wherein f4For fusion function, and classify to the Rep by topic classifier, and according to classification accuracy Error extension is obtained with topic similarity index, and updates model parameter using target function gradient descending method.
CN201810999545.6A 2018-08-30 2018-08-30 Text document representation method and device based on deep learning topic information enhancement Active CN109241377B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810999545.6A CN109241377B (en) 2018-08-30 2018-08-30 Text document representation method and device based on deep learning topic information enhancement

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810999545.6A CN109241377B (en) 2018-08-30 2018-08-30 Text document representation method and device based on deep learning topic information enhancement

Publications (2)

Publication Number Publication Date
CN109241377A true CN109241377A (en) 2019-01-18
CN109241377B CN109241377B (en) 2021-04-23

Family

ID=65069456

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810999545.6A Active CN109241377B (en) 2018-08-30 2018-08-30 Text document representation method and device based on deep learning topic information enhancement

Country Status (1)

Country Link
CN (1) CN109241377B (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110135551A (en) * 2019-05-15 2019-08-16 西南交通大学 A kind of robot chat method of word-based vector sum Recognition with Recurrent Neural Network
CN110298038A (en) * 2019-06-14 2019-10-01 北京奇艺世纪科技有限公司 A kind of text scoring method and device
CN110489563A (en) * 2019-07-22 2019-11-22 北京百度网讯科技有限公司 Representation method, device, equipment and the computer readable storage medium of graph structure
CN111339762A (en) * 2020-02-14 2020-06-26 广州大学 Topic representation model construction method and device based on hybrid intelligence
CN111339783A (en) * 2020-02-24 2020-06-26 东南大学 RNTM-based topic mining method and device
CN111639189A (en) * 2020-04-29 2020-09-08 西北工业大学 Text graph construction method based on text content features
CN111738303A (en) * 2020-05-28 2020-10-02 华南理工大学 Long-tail distribution image identification method based on hierarchical learning
CN111858931A (en) * 2020-07-08 2020-10-30 华中师范大学 Text generation method based on deep learning
CN111949790A (en) * 2020-07-20 2020-11-17 重庆邮电大学 Emotion classification method based on LDA topic model and hierarchical neural network
CN111966792A (en) * 2020-09-03 2020-11-20 网易(杭州)网络有限公司 Text processing method and device, electronic equipment and readable storage medium
CN112861982A (en) * 2021-02-24 2021-05-28 佛山市南海区广工大数控装备协同创新研究院 Long-tail target detection method based on gradient average
CN115563284A (en) * 2022-10-24 2023-01-03 重庆理工大学 Deep multi-instance weak supervision text classification method based on semantics

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170192956A1 (en) * 2015-12-31 2017-07-06 Google Inc. Generating parse trees of text segments using neural networks
CN107092596A (en) * 2017-04-24 2017-08-25 重庆邮电大学 Text emotion analysis method based on attention CNNs and CCR
CN107168957A (en) * 2017-06-12 2017-09-15 云南大学 A kind of Chinese word cutting method
CN107368613A (en) * 2017-09-05 2017-11-21 中国科学院自动化研究所 Short text sentiment analysis method and device
WO2018085722A1 (en) * 2016-11-04 2018-05-11 Salesforce.Com, Inc. Quasi-recurrent neural network
CN108446275A (en) * 2018-03-21 2018-08-24 北京理工大学 Long text emotional orientation analytical method based on attention bilayer LSTM
CN108460013A (en) * 2018-01-30 2018-08-28 大连理工大学 A kind of sequence labelling model based on fine granularity vocabulary representation model
CN108460019A (en) * 2018-02-28 2018-08-28 福州大学 A kind of emerging much-talked-about topic detecting system based on attention mechanism

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170192956A1 (en) * 2015-12-31 2017-07-06 Google Inc. Generating parse trees of text segments using neural networks
WO2018085722A1 (en) * 2016-11-04 2018-05-11 Salesforce.Com, Inc. Quasi-recurrent neural network
CN107092596A (en) * 2017-04-24 2017-08-25 重庆邮电大学 Text emotion analysis method based on attention CNNs and CCR
CN107168957A (en) * 2017-06-12 2017-09-15 云南大学 A kind of Chinese word cutting method
CN107368613A (en) * 2017-09-05 2017-11-21 中国科学院自动化研究所 Short text sentiment analysis method and device
CN108460013A (en) * 2018-01-30 2018-08-28 大连理工大学 A kind of sequence labelling model based on fine granularity vocabulary representation model
CN108460019A (en) * 2018-02-28 2018-08-28 福州大学 A kind of emerging much-talked-about topic detecting system based on attention mechanism
CN108446275A (en) * 2018-03-21 2018-08-24 北京理工大学 Long text emotional orientation analytical method based on attention bilayer LSTM

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
CEDRIC DE BOOM等: "Representation learning for very short texts using weighted word embedding aggregation", 《PATTERN RECOGNITION LETTERS》 *
ZENGJIAN LIU等: "Entity recognition from clinical texts via recurrent neural network", 《BMC MEDICAL INFORMATICS AND DECISION MAKING》 *
庄丽榕等: "基于CSLSTM网络的文本情感分类", 《计算机系统应用》 *
庞宇明: "一种基于深度学习与Labeled-LDA的文本分类方法", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
赵勤鲁等: "基于LSTM-Attention神经网络的文本特征提取方法", 《现代电子技术》 *

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110135551B (en) * 2019-05-15 2020-07-21 西南交通大学 Robot chatting method based on word vector and recurrent neural network
CN110135551A (en) * 2019-05-15 2019-08-16 西南交通大学 A kind of robot chat method of word-based vector sum Recognition with Recurrent Neural Network
CN110298038B (en) * 2019-06-14 2022-12-06 北京奇艺世纪科技有限公司 Text scoring method and device
CN110298038A (en) * 2019-06-14 2019-10-01 北京奇艺世纪科技有限公司 A kind of text scoring method and device
CN110489563B (en) * 2019-07-22 2022-08-23 北京百度网讯科技有限公司 Method, device, equipment and computer readable storage medium for representing graph structure
CN110489563A (en) * 2019-07-22 2019-11-22 北京百度网讯科技有限公司 Representation method, device, equipment and the computer readable storage medium of graph structure
CN111339762A (en) * 2020-02-14 2020-06-26 广州大学 Topic representation model construction method and device based on hybrid intelligence
CN111339762B (en) * 2020-02-14 2023-04-07 广州大学 Topic representation model construction method and device based on hybrid intelligence
CN111339783A (en) * 2020-02-24 2020-06-26 东南大学 RNTM-based topic mining method and device
CN111339783B (en) * 2020-02-24 2022-11-25 东南大学 RNTM-based topic mining method and device
CN111639189A (en) * 2020-04-29 2020-09-08 西北工业大学 Text graph construction method based on text content features
CN111639189B (en) * 2020-04-29 2023-03-21 西北工业大学 Text graph construction method based on text content features
CN111738303A (en) * 2020-05-28 2020-10-02 华南理工大学 Long-tail distribution image identification method based on hierarchical learning
CN111738303B (en) * 2020-05-28 2023-05-23 华南理工大学 Long-tail distribution image recognition method based on hierarchical learning
CN111858931B (en) * 2020-07-08 2022-05-13 华中师范大学 Text generation method based on deep learning
CN111858931A (en) * 2020-07-08 2020-10-30 华中师范大学 Text generation method based on deep learning
CN111949790A (en) * 2020-07-20 2020-11-17 重庆邮电大学 Emotion classification method based on LDA topic model and hierarchical neural network
CN111966792A (en) * 2020-09-03 2020-11-20 网易(杭州)网络有限公司 Text processing method and device, electronic equipment and readable storage medium
CN111966792B (en) * 2020-09-03 2023-07-25 网易(杭州)网络有限公司 Text processing method and device, electronic equipment and readable storage medium
CN112861982A (en) * 2021-02-24 2021-05-28 佛山市南海区广工大数控装备协同创新研究院 Long-tail target detection method based on gradient average
CN115563284A (en) * 2022-10-24 2023-01-03 重庆理工大学 Deep multi-instance weak supervision text classification method based on semantics

Also Published As

Publication number Publication date
CN109241377B (en) 2021-04-23

Similar Documents

Publication Publication Date Title
CN109241377A (en) A kind of text document representation method and device based on the enhancing of deep learning topic information
US11868883B1 (en) Intelligent control with hierarchical stacked neural networks
Barz et al. Hierarchy-based image embeddings for semantic image retrieval
Zhao et al. Open vocabulary scene parsing
CN109918528A (en) A kind of compact Hash code learning method based on semanteme protection
CN105975573A (en) KNN-based text classification method
Saha et al. A Lightning fast approach to classify Bangla Handwritten Characters and Numerals using newly structured Deep Neural Network
CN112364638A (en) Personality identification method based on social text
CN110874410A (en) Text classification method based on long-time and short-time memory network and convolutional neural network
CN114241273A (en) Multi-modal image processing method and system based on Transformer network and hypersphere space learning
Huang et al. Siamese network-based supervised topic modeling
CN109062958B (en) Primary school composition automatic classification method based on TextRank and convolutional neural network
CN115688024A (en) Network abnormal user prediction method based on user content characteristics and behavior characteristics
CN111507093A (en) Text attack method and device based on similar dictionary and storage medium
CN111259147B (en) Sentence-level emotion prediction method and system based on self-adaptive attention mechanism
Nanehkaran et al. A pragmatic convolutional bagging ensemble learning for recognition of Farsi handwritten digits
CN109948163B (en) Natural language semantic matching method for dynamic sequence reading
Artemov et al. Informational neurobayesian approach to neural networks training. Opportunities and prospects
CN111581365A (en) Predicate extraction method
Aljaafari Ichthyoplankton classification tool using Generative Adversarial Networks and transfer learning
Deng Large scale visual recognition
Kim et al. CNN based sentence classification with semantic features using word clustering
Preetham et al. Comparative Analysis of Research Papers Categorization using LDA and NMF Approaches
Hilmiaji et al. Identifying Emotion on Indonesian Tweets using Convolutional Neural Networks
Chen Insincere Question Classification by Deep Neural Networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant