CN110232127A - File classification method and device - Google Patents

File classification method and device Download PDF

Info

Publication number
CN110232127A
CN110232127A CN201910523985.9A CN201910523985A CN110232127A CN 110232127 A CN110232127 A CN 110232127A CN 201910523985 A CN201910523985 A CN 201910523985A CN 110232127 A CN110232127 A CN 110232127A
Authority
CN
China
Prior art keywords
information
text
word
sequence
sequence information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910523985.9A
Other languages
Chinese (zh)
Other versions
CN110232127B (en
Inventor
杨开平
谌立
熊永福
冯岭子
龚伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing Unisinsight Technology Co Ltd
Original Assignee
Chongqing Unisinsight Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing Unisinsight Technology Co Ltd filed Critical Chongqing Unisinsight Technology Co Ltd
Priority to CN201910523985.9A priority Critical patent/CN110232127B/en
Publication of CN110232127A publication Critical patent/CN110232127A/en
Application granted granted Critical
Publication of CN110232127B publication Critical patent/CN110232127B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of file classification method and device, is related to natural language processing field.This method comprises: obtaining text to be sorted;According to the word information and preset model of text to be sorted, the term vector of text to be sorted is obtained, according to term vector and prediction algorithm, obtains the text vector of text to be sorted;Using two-way length memory network LSTM model training sequence information, the incidence relation of word information and column information is predicted;The incidence relation of text vector and sequence information is integrated, and the incidence relation of text vector and sequence information after integration is input in default disaggregated model, obtains the classification of text.By the way that the probability for calculating current text sequence in sequence information and occurring is preset in using two-way LSTM model in sequence information, in combination with text vector, it can precisely predict the semantic feature and sequence signature of text to be sorted, it will be input to classifier after semantic feature and sequence signature integration, the exact classification of text to be sorted can be obtained.

Description

File classification method and device
Technical field
The present invention relates to natural language processing technique fields, in particular to a kind of file classification method and device.
Background technique
It, can be with by text classification in natural language processing field (Natura Language Processing, NLP) It helps user efficiently to manage text data, and provides base support for text mining, for example, the classification typesetting of news, shelves The Classification Management of case, the information retrieval of search engine, answer search of question answering system etc..
In the prior art, rule-based text classification indicates usual directed quantity spatial model (Vector Space Model, VSM), the language model three types of topic model and deep learning.Wherein, VSM model is obtained based on word The classification information of text applies in general to long article notebook data;Topic model obtains text by learning the text semantic information of shallow-layer This classification information;The learning text character representation that the language model of deep learning can automate.
But VSM model generally understands lost part semantic information and sequence information.Topic model can only learn the text of shallow-layer This semantic information, obtained semanteme are fuzzy, coarsenesses.Do not have in the language model text classification of deep learning easy-to-use Property and scalability, that is, be suitble to short text classification unsuitable long text classification, be suitble to long text classification again it is unsuitable The classification of short text.
Summary of the invention
It is an object of the present invention in view of the deficiency of the prior art, a kind of file classification method and device are provided, It is not strong to solve prior art textual classification model ease for use, the problems in classification inaccuracy.
To achieve the above object, technical solution used in the embodiment of the present invention is as follows:
In a first aspect, the embodiment of the invention provides a kind of file classification methods, comprising: obtain text to be sorted, wherein Text to be sorted includes: word information and sequence information, and multiple word information constitute the sequence information;According to text to be sorted Word information and preset model, obtain the term vector of text to be sorted, according to term vector and preset algorithm, obtain to be sorted The text vector of text;Using two-way length memory network LSTM model training sequence information, predict that word information and sequence are believed The incidence relation of breath, wherein two-way LSTM model includes: preceding to LSTM model and backward LSTM model;Integrate text vector with The incidence relation of sequence information, and the incidence relation of text vector and the sequence information after integration is input to default classification In model, the classification of text is obtained.
Optionally, using two-way length memory network LSTM model training sequence information, predict that word information and sequence are believed Before the incidence relation of breath, further includes: if the word information content in sequence information is less than preset length, supplement default number It is worth, the sequence information after being supplemented, the word information content of the sequence information after supplement is preset length;Using two-way LSTM Sequence information after model training supplement, the incidence relation of the sequence information after obtaining word information and supplement.
Optionally, using two-way length memory network LSTM model training sequence information, the word information and sequence are predicted The incidence relation of column information, comprising: if the word information content in sequence information is greater than preset length, delete partial words letter Breath, the sequence information after being deleted, the word information content of the sequence information after deletion are preset length;Using two-way LSTM Sequence information after model training deletion, the incidence relation of the sequence information after obtaining word information and deleting.
Optionally, it using two-way length memory network LSTM model training sequence information, obtains word information and sequence is believed The incidence relation of breath, comprising:
Using preceding to LSTM model training sequence information, the incidence relation for obtaining word information and sequence information isWherein, indicate to include n word information in text to be sorted with n Phrase is (t1, t2..., tn-1, tn), p (t1, t2..., tk-1, tk) indicate (t occur1, t2..., tk-1, tk) this kind of sequence knot Structure sentence information;
To LSTM training sequence information after, obtains word information and the incidence relation of the sequence information isWherein, p (tk+1, tk+2..., tn-1, tn) indicate (t occurk+1, tk+2..., tn-1, tn) this kind of sequential structure sentence information probability;
It is single-layer bidirectional LSTM model in conjunction with forward and backward language message, log likelihood function is as follows:
Current term information and forward direction word information and backward word information in the sequence information are predicted according to above formula Incidence relation.
Optionally, according to the word information and preset model of text to be sorted, obtain the word of text to be sorted to Before amount, further includes: treat classifying text and carry out word segmentation processing.
Second aspect, the embodiment of the invention also provides a kind of document sorting apparatus, comprising:
Module is obtained, for obtaining text to be sorted, wherein text to be sorted includes: word information and sequence information, more A word information constitutes the sequence information;Obtain module, specifically for according to the word information of text to be sorted and Preset model obtains the term vector of text to be sorted, according to term vector and preset algorithm, obtain the text of text to be sorted to Amount;Prediction module, for using two-way length memory network LSTM model training sequence information, prediction word information and sequence are believed The incidence relation of breath, wherein two-way LSTM model includes: preceding to LSTM model and backward LSTM model;Module is integrated, for whole The incidence relation of text vector and sequence information is closed, and the incidence relation of the text vector and sequence information after integration is defeated Enter into default disaggregated model, obtains the classification of text.
Optionally, further includes: complementary module, if being less than preset length for the word information content in sequence information, Default number, the sequence information after supplement are supplemented, the word information content of the sequence information after supplement is preset length;It adopts With the sequence information after two-way LSTM model training supplement, the incidence relation of the sequence information after obtaining word information and supplement.
Optionally, complementary module is deleted if being greater than preset length specifically for the word information content in sequence information Partial words information, the sequence information after being deleted, the word information content of the sequence information after deletion are preset length;It adopts Sequence information after being deleted with two-way LSTM model training, the incidence relation of the sequence information after obtaining word information and deleting.
Optionally, prediction module is specifically used for, and using preceding to LSTM model training sequence information, obtains word information and sequence The incidence relation of column information isWherein, it is indicated in text to be sorted with n Phrase comprising n word information is (t1, t2..., tn-1, tn), p (t1, t2..., tk-1, tk) indicate (t occur1, t2..., tk-1, tk) this kind of sequential structure sentence information probability;To the LSTM training sequence information after, obtain word information with The incidence relation of the sequence information isWherein, p (tk+1, tk+2..., tn-1, tn) indicate (t occurk+1, tk+2..., tn-1, tn) this kind of sequential structure sentence information probability;In conjunction with it is preceding to It is single-layer bidirectional LSTM model with backward language message, log likelihood function is as follows:
According to being associated with for current term information in above formula forecasting sequence information and forward direction word information and backward word information Relationship.
Optionally, further includes: processing module carries out word segmentation processing for treating classifying text.
The beneficial effects of the present invention are:
The present invention provides a kind of file classification method and device, and text classification method includes: to obtain text to be sorted, In, text to be sorted includes: word information and sequence information, and multiple word information constitute sequence information;According to text to be sorted Word information and preset model, obtain the term vector of text to be sorted, according to term vector and prediction algorithm, obtain to be sorted The text vector of text;Using two-way length memory network LSTM model training sequence information, word information and column information are predicted Incidence relation, wherein two-way LSTM model include: before to LSTM model and backward LSTM model;Integrate text vector and sequence The incidence relation of column information, and the incidence relation of text vector and sequence information after integration is input to default disaggregated model In, obtain the classification of text.The present invention is calculated by being preset in sequence information in sequence information using two-way LSTM model The probability that current text sequence occurs can precisely predict the semantic feature and sequence of text to be sorted in combination with text vector Feature will be input to classifier after semantic feature and sequence signature integration, can obtain the exact classification of text to be sorted.
Detailed description of the invention
In order to illustrate the technical solution of the embodiments of the present invention more clearly, below will be to needed in the embodiment attached Figure is briefly described, it should be understood that the following drawings illustrates only certain embodiments of the present invention, therefore is not construed as pair The restriction of range for those of ordinary skill in the art without creative efforts, can also be according to this A little attached drawings obtain other relevant attached drawings.
Fig. 1 is a kind of file classification method flow diagram provided by the invention;
Fig. 2 is a kind of file classification method schematic diagram provided by the invention;
Fig. 3 is the file classification method flow diagram that one embodiment of the invention provides;
Fig. 4 is the file classification method flow diagram that further embodiment of this invention provides;
Fig. 5 be another embodiment of the present invention provides file classification method flow diagram;
Fig. 6 is a kind of document sorting apparatus structural schematic diagram provided by the invention;
Fig. 7 is the document sorting apparatus structural schematic diagram that one embodiment of the invention provides;
Fig. 8 is the document sorting apparatus structural schematic diagram that further embodiment of this invention provides;
Fig. 9 be another embodiment of the present invention provides document sorting apparatus structural schematic diagram.
Specific embodiment
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.
Explanation of nouns:
Word2vec (word to vector): being the correlation model that a group is used to generate term vector, can use current Word predicts surrounding word, and predicts current term using the word of surrounding, and realizes and generate term vector.Training completion Afterwards, word2vec model can be used to map each word to a vector, can be used to indicate word to the relationship between word.
ELMo model: a kind of novel depth contextualized vocabulary sign, word can be carried out complex characteristic (such as syntax and semantic) and Variation of the word in language context is modeled and (is modeled to polysemant).
Shot and long term memory network (Long Short-Term Memory, abbreviation LSTM): it is placed in LSTM model Three fan doors, are called input gate respectively, forget door and out gate.One information enters in the network of LSTM, can be according to rule To determine whether useful.The information for only meeting algorithm certification can just leave, and the information not being inconsistent then passes through forgetting door and passes into silence.
Fig. 1 is a kind of file classification method flow diagram provided by the invention.This method can be by computer, server It is executed Deng the equipment with processing function, as shown in Figure 1, this method comprises:
S110, text to be sorted is obtained, wherein text to be sorted includes: word information and sequence information, multiple word letters Breath constitutes the sequence information.
Text to be sorted can be one section of brief sentence, word, be also possible to a document, such as news report, literature Works.Text to be sorted herein can be short text, be also possible to long text, herein with no restrictions.
Wherein, text to be sorted either short text or long text, content of text is made of sentence information, and every A sentence information is made of multiple word information, wherein the state before forming sentence in advance by multiple word information is known as sequence Column information.
For example, including the sentence information of " sun today is bright and beautiful " in a text, wherein " today ", " sunlight " and " bright Fawn " the word information of instruction among the above.It include more by the sequence information that word information therein forms for this sentence information It is a, such as: today/sunlight/bright and beautiful, today/bright and beautiful/sunlight, sunlight/bright and beautiful/today, sunlight/today/bright and beautiful, bright and beautiful/sunlight/ Today, bright and beautiful/today/sunlight.It is apparent that today/sunlight/this bright and beautiful sequence information is just required in composition text sentence Sequence, therefore, state before taking multiple word information to form sentence in advance is sequence information.
S120, word information and preset model according to text to be sorted, obtain the term vector of text to be sorted, according to Term vector and preset algorithm obtain the text vector of text to be sorted.
Since the information content that lexical information database includes is very big, will to look like in lexical information database, similar word is united Meter, therefore, by each word information MAP into Spatial Dimension, indicates each with term vector (Word embedding) Word information.
During obtaining term vector, due to the word Limited information of to be sorted point of sheet, for make to train resulting word to Amount can precisely indicate word information, obtain the term vector of high quality in combination with preset model.
Above-mentioned preset model can be Word2vec model or ELMo model, and the present invention is with no restrictions.
In Spatial Dimension, its distance of the word information of similar import is closer, such as " sun today in model sentence It is bright and beautiful ", there are three word information, " today ", " sunlight " and " bright and beautiful ", then with " today " semantic similarity in this sentence information Distance in Spatial Dimension of word " tomorrow ", " day after tomorrow ", " yesterday " etc. it is close with " today " distance.Indicate that this is similar In word information and the sentence information when other word information group sentences, the similar import of expression, and class categories are close.
It should be noted that trained term vector model can also be used in file classification method provided by the invention, Such as 8,000,000 term vector data of open source in 2018.
Further, after the word information whole word vector expression in text to be sorted, the word in text is believed Breath combines preset algorithm, obtains the text vector of text to be sorted, in order to express this using a specific vector Text information.
The preset algorithm is that the corresponding term vector of word information in text to be sorted is summed, and is averaged.If n The phrase of word is (t1, t ..., tn-1, tn), then calculation are as follows:
In formula, v (t) indicates the expression way for converting t word to space vector.
By above formula, the text vector of text to be sorted can be obtained.
S130, using two-way length memory network LSTM model training sequence information, predict word information and sequence information Incidence relation, wherein two-way LSTM model include: before to LSTM model and backward LSTM model.
Wherein, traditional LSTM model is a kind of language message of memory sequences front, be may learn in sentence information Estimation of the word information of front sequence to sequence word information below, but cannot learn to sequence word information below to preceding The estimation of face sequence word information.Therefore, file classification method provided by the invention uses two-way LSTM mould in sequence information Type is trained, and the accuracy of model learning sequence information can be improved.
Fig. 2 is a kind of file classification method schematic diagram provided by the invention.As shown in Fig. 2, being shown with word information table wait divide Semantic feature in class text, with it is preceding to LSTM model and backward LSTM model respectively to the forward direction word in each word information Language and backward word are trained, final output in sequence information with the forward direction word of the information-related relationship of current term and Backward word.
It should be noted that above-mentioned incidence relation instruction can be combined together in sequence information with current term information Form the forward direction word and backward word of sentence information.
A kind of embodiment, it is preceding to LSTM model training sequence using in the example above sentence " sun today is bright and beautiful " It can may be " brilliance " for " bright and beautiful " after available " sunlight ", but in training when information, it can also after " sunlight " A possibility that thinking " very good ", and being associated with " today " after " sunlight " is with regard to smaller.The preceding purpose to LSTM model be in order to Training obtain can backward word information associated with " sunlight " probability.
When after use to LSTM model training sequence information, can be before available " sunlight " " today ", it can also be with For " tomorrow " or " here ", thus obtain can forward direction word associated with " sunlight " probability.
Further, through the foregoing embodiment, it for the sentence information of " sun today is bright and beautiful ", can extend to obtain: " modern It is sunny ", " sun today is very good ", the semantic similarities such as " tomorrow is sunny " sentence information.
It should be noted that file classification method provided by the invention, in sequence information using preceding to LSTM model and Backward LSTM model training text obtains the forward direction word and backward word with the information-related relationship of current term, not more than Embodiment is stated as limitation.
Optionally, word letter that is above-mentioned to use two-way length memory network LSTM model training sequence information, being predicted Breath can have the vector of foregoing sequences information to be indicated with the incidence relation of sequence information by memory, that is, before to LSTM model and backward LSTM model, the result exported can have the vector of foregoing sequences information for memory.
Optionally, there is the vector of forward sequence information since forward direction LSTM model can export memory, and backward LSTM model Can export memory have after to sequence information vector, so, above-mentioned memory has the vector of foregoing sequences information, can be by right The vector that forward direction LSTM model and backward LSTM model export respectively carries out calculating obtained vector average value.For example, can be with Institute's directed quantity is added, its average value is then taken.It should be noted that if vector is various dimensions vector, then above-mentioned memory The vector for having foregoing sequences information is that each dimension of each vector is carried out average rear obtained various dimensions respectively Vector.
As described above, by characterizing predicted word information and sequence information with the average value of foregoing sequences information Incidence relation is conducive to macroscopical estimated sequence information in this way, while being also beneficial to keep in dimension with the vector of semantic information Unanimously.
S140, the incidence relation for integrating text vector and sequence information, and by after integration text vector and the sequence The incidence relation of information is input in default disaggregated model, obtains the classification of text.
Due to the sentence information for including in text be it is a variety of, will be obtained in step S120 in text vector and step S130 To text in sequence information prediction current term forward direction word information and backward word information integrated, can be obtained Multiple expanded texts of current text to be sorted.It wherein, may include: to be included in expanded text with aforementioned texts vector Semantic and foregoing sequences information vector.
Multiple expanded texts are input in default disaggregated model, default disaggregated model can automatically derive the class of the text Not.
It should be noted that the default disaggregated model can be softmax disaggregated model, but the present invention not with Softmax disaggregated model is limitation.
Document Classification Method provided by the invention is believed by being preset in sequence using two-way LSTM model in sequence information The probability that current text sequence occurs is calculated in breath, in combination with text vector, can precisely predict that the semanteme of text to be sorted is special It seeks peace sequence signature, classifier will be input to after semantic feature and sequence signature integration, the exact classification of text to be sorted can be obtained.
Fig. 3 provides file classification method flow diagram for one embodiment of the invention, as shown in figure 3, executing step Before S130, also need to pre-process the word information in sequence information, which includes:
If the word information content in S210, sequence information is less than preset length, default number is supplemented, after obtaining supplement Sequence information, the word information content of the sequence information after supplement is preset length.
Since LSTM mode input word information is that the data that can be handled are fixed length, therefore, needed in input The word information of sequence information is pre-processed.For example, taking the word for averagely including in the sequence information in text to be sorted Information is as preset length, calculation formula are as follows:
Wherein,Indicate the length for the word information for averagely including in sequence information, siIndicate textual data Sequence in, i are sequence number, and N is sequence sum.
If word information content in sequence information is less than preset length, default number is supplemented to average length, herein Default number be 0.
When another embodiment such as the average length of the word information in sequence information is 8, " today pretty good sunlight of weather It is bright and beautiful ", then first input of LSTM model is v (" today "), second input is v (" weather ") ..., the 7th input is 0, n-th of input is 0.Here v (x) indicates the vector value of word information x.
S220, the sequence using the sequence information after two-way LSTM model training supplement, after obtaining word information and supplement The incidence relation of information.
Fig. 4 provides file classification method flow diagram for one embodiment of the invention, as shown in figure 4, in sequence information Word information carry out pretreated method, further includes:
If the word information content in S310, sequence information is greater than preset length, partial words information is deleted, is deleted Sequence information after removing, the word information content of the sequence information after deletion are preset length.
I.e. when the word number for including in sequence information is more than hyper parameter average length, then the word of overage is deleted Language.The deletion is the term vector deleted after the average length.
Another embodiment, as the word information in sequence information average length be 3 when, " today, the pretty good sunlight of weather was bright Fawn ", then first input of LSTM model is v (" today "), second input is v (" weather "), third input is v (" good "), the term vector that subsequent word information represents are deleted.Here v (x) indicates the vector value of word information x.
It should be noted that the average length of the word information in sequence information is not limitation with above-described embodiment.
S320, the sequence using the sequence information after the deletion of two-way LSTM model training, after obtaining word information and deleting The incidence relation of information.
Further, step is 130, comprising:
The incidence relation of word information and sequence information is obtained to LSTM model training sequence information using preceding, is calculated public Formula are as follows:
In formula, n indicates that the phrase in text to be sorted comprising n word information is (t1, t2..., tn-1, tn), tkIt indicates K-th of word information, p (t1, t2..., tk) indicate this kind (t occur1, t2..., tk-1, tk) the sentence information of sequential structure is Probability.
There are many compositions of " sun today is bright and beautiful " the sentence information known to through the foregoing embodiment, but target composition is only There is one kind, the p (t1, t2..., tk) indicate the probability that composition target sequence structured statement information occurs.
To LSTM training sequence information after, the incidence relation of word information and the sequence information is obtained, is calculated public Formula are as follows:
Wherein p (tk+1, tk+2..., tn-1, tn) indicate (t occurk+1, tk+2..., tn-1, tn) this kind of sequential structure language Sentence information is probability.
In conjunction with it is preceding to information and backward language message be single-layer bidirectional LSTM model, log likelihood function is as follows:
It can be with current term information in forecasting sequence information and forward direction word information and backward word information according to above formula Incidence relation.
Fig. 5 be another embodiment of the present invention provides file classification method flow diagram, as shown in figure 5, executing step Before rapid S120, it is also necessary to treat classifying text and be pre-processed, which includes:
S111, classifying text progress word segmentation processing is treated.
Before the expression of learning text vector, need to carry out word segmentation processing to text, participle is by continuous sentence sequence Column split into the process of sequence of terms according to certain specification again.In Chinese text, the words in sentence is all connected to It together, and in English text is using space as delimiter between word and word, and Chinese can pass through word, sentence, section Simply demarcate Deng by significantly decomposing symbol, only on word none form section break, word links together with word As a complete expression written form.Therefore, accurate word information is obtained, needs to carry out word segmentation processing.
For example, sentence " this cartoon is firmly got child and liked " is segmented, this available portion/cartoon/ Firmly get/child// like these words, only participle accurately on the basis of, can just obtain accurate sequence information and word The incidence relation of language information.
Document Classification Method provided by the invention, when carrying out word segmentation processing, can using based on string matching algorithm, Based on statistics and segmenting method, the mechanical Chinese word segmentation algorithm of machine learning etc., the segmentation methods applied when at this to word segmentation processing With no restrictions.
Further, the disaggregated model that the present invention uses is when carrying out text classification, training process are as follows:
Obtain different classes of samples of text, wherein samples of text includes tag along sort.The procedural representation obtains Know the text of a large amount of classifications, such as obtain a large amount of " sport " class texts, then a large amount of known class text is done at participle respectively Reason, can be obtained the multiple words for including in each text, and further do " sport " to the word information after each text participle Class label.For example, can realize the classification to a certain piece document by the disaggregated model, the document can be to be made of plurality of articles Text collection, each text in text collection includes the text data of the compositions such as article, sentence, word.
Further, the characteristic information that the word of " sport " class is described in each text is extracted respectively, and classifier study is a large amount of " sport " class text, and then the common trait of the word information in " sport " class text included can be somebody's turn to do.
When any text input to disaggregated model, classifier can learn whether the text belongs to " sport " class automatically.
It should be noted that file classification method provided in an embodiment of the present invention, is not limitation with " sport " class, may be used also To include other classifications.
There is a kind of embodiment, file classification method provided by the invention can also be based on sentence information, text to be sorted Include multiple sentence information in this, by the way that sentence information is converted to corresponding sentence vector, and further obtains text to be sorted Text vector in this about sentence information vector.
Further, it by two-way LSTM model, is respectively trained and appears in preceding paragraph in current statement information and backward The related information of sentence information, and by the related information in conjunction with text vector, file classification method can be obtained.
It should be noted that the long text information for including in text can be carried out essence by based on sentence information Really classification, improves the accuracy and ease for use of sorting algorithm.
Fig. 6 is document sorting apparatus structural schematic diagram provided by the invention, as shown in fig. 6, the device specifically includes: being obtained Module 601, integrates module 603 at prediction module 602.Wherein:
Module 601 is obtained, for obtaining text to be sorted, wherein text to be sorted includes: word information and sequence letter Breath, multiple word information constitute the sequence information.
Module 601 is obtained, specifically for the word information and preset model according to text to be sorted, is obtained wait divide The term vector of class text obtains the text vector of text to be sorted according to term vector and preset algorithm.
Prediction module 602 predicts word information for using two-way length memory network LSTM model training sequence information With the incidence relation of sequence information, wherein two-way LSTM model includes: preceding to LSTM model and backward LSTM model.
Module 603 is integrated, for integrating the incidence relation of text vector and sequence information, and by the text after integration The incidence relation of vector and sequence information is input in default disaggregated model, obtains the classification of text.
Fig. 7 is the document sorting apparatus structural schematic diagram that one embodiment of the invention provides, as shown in fig. 7, the present invention provides Device further include: complementary module 604, in which:
Complementary module 604 supplements default number if being less than preset length for the word information content in sequence information It is worth, the sequence information after being supplemented, the word information content of the sequence information after supplement is preset length.Using two-way LSTM Sequence information after model training supplement, the incidence relation of the sequence information after obtaining word information and supplement.
Further, complementary module 604, if being greater than preset length specifically for the word information content in sequence information, Partial words information is then deleted, the word information content of the sequence information after being deleted, the sequence information after deletion is default Length.Sequence information after being deleted using two-way LSTM model training, the pass of the sequence information after obtaining word information and deleting Connection relationship.
Optionally, prediction module 602 are specifically used for obtaining word information to LSTM model training sequence information using preceding Incidence relation with sequence information isWherein, n indicates text to be sorted In the phrase comprising n word information for (t1, t2..., tn-1, tn), p (t1, t2..., tk-1, tk) indicate (t occur1, t2..., tk-1, tk) this kind of sequential structure sentence information probability;To the LSTM training sequence information after, word is obtained Information and the incidence relation of the sequence information areWherein, p (tk+1, tk+2..., tn-1, tn) indicate (t occurk+1, tk+2..., tn-1, tn) this kind of sequential structure sentence information probability;
It is single-layer bidirectional LSTM model in conjunction with forward and backward language message, log likelihood function is as follows:
According to being associated with for current term information in above formula forecasting sequence information and forward direction word information and backward word information Relationship.
Fig. 8 is the document sorting apparatus structural schematic diagram that further embodiment of this invention provides, as shown in figure 8, the present invention mentions The device of confession further include: processing module 605, in which:
Processing module 605 carries out word segmentation processing for treating classifying text.
The method that above-mentioned apparatus is used to execute previous embodiment offer, it is similar that the realization principle and technical effect are similar, herein not It repeats again.
The above module can be arranged to implement one or more integrated circuits of above method, such as: one Or multiple specific integrated circuits (Application Specific Integrated Circuit, abbreviation ASIC), or, one Or multi-microprocessor (digital singnal processor, abbreviation DSP), or, one or more field programmable gate Array (Field Programmable Gate Array, abbreviation FPGA) etc..For another example, when some above module passes through processing elements When the form of part scheduler program code is realized, which can be general processor, such as central processing unit (Central Processing Unit, abbreviation CPU) or it is other can be with the processor of caller code.For another example, these modules can integrate Together, it is realized in the form of system on chip (system-on-a-chip, abbreviation SOC).
Fig. 9 is the document sorting apparatus structural schematic diagram that further embodiment of this invention provides, which can integrate in end The chip of end equipment or terminal device, the terminal can be the calculating equipment for having classification processing function.
As shown in figure 9, the device includes: memory 901, processor 902.
Memory 901 is for storing program, the program that processor 902 calls memory 901 to store, to execute the above method Embodiment.Specific implementation is similar with technical effect, and which is not described herein again.
Optionally, the present invention also provides a kind of program product, such as computer readable storage medium, including program, the journeys Sequence is when being executed by processor for executing above method embodiment.
In several embodiments provided by the present invention, it should be understood that disclosed device and method can pass through it Its mode is realized.For example, the apparatus embodiments described above are merely exemplary, for example, the division of the unit, only Only a kind of logical function partition, there may be another division manner in actual implementation, such as multiple units or components can be tied Another system is closed or is desirably integrated into, or some features can be ignored or not executed.Another point, it is shown or discussed Mutual coupling, direct-coupling or communication connection can be through some interfaces, the INDIRECT COUPLING or logical of device or unit Letter connection can be electrical property, mechanical or other forms.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme 's.
It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of hardware adds SFU software functional unit.
The above-mentioned integrated unit being realized in the form of SFU software functional unit can store and computer-readable deposit at one In storage media.Above-mentioned SFU software functional unit is stored in a storage medium, including some instructions are used so that a computer Equipment (can be personal computer, server or the network equipment etc.) or processor (English: processor) execute this hair The part steps of bright each embodiment the method.And storage medium above-mentioned includes: USB flash disk, mobile hard disk, read-only memory (English: Read-Only Memory, abbreviation: ROM), random access memory (English: Random Access Memory, letter Claim: RAM), the various media that can store program code such as magnetic or disk.

Claims (10)

1. a kind of file classification method characterized by comprising
Obtain text to be sorted, wherein the text to be sorted includes: word information and sequence information, multiple word letters Breath constitutes the sequence information;
According to the word information and preset model of the text to be sorted, the term vector of the text to be sorted is obtained, According to the term vector and preset algorithm, the text vector of the text to be sorted is obtained;
Using sequence information described in two-way length memory network LSTM model training, predict that the word information and the sequence are believed The incidence relation of breath, wherein the two-way LSTM model includes: preceding to LSTM model and backward LSTM model;
Integrate the incidence relation of the text vector Yu the sequence information, and by after integration the text vector and the sequence The incidence relation of column information is input in default disaggregated model, obtains the classification of the text.
2. file classification method as described in claim 1, which is characterized in that described to use two-way length memory network LSTM mould The type training sequence information, before the incidence relation for predicting the word information and the sequence information, further includes:
If the word information content in the sequence information is less than preset length, default number is supplemented, the sequence after being supplemented Column information, the word information content of the sequence information after the supplement are the preset length;
Sequence using the sequence information after being supplemented described in two-way LSTM model training, after obtaining the word information and the supplement The incidence relation of column information.
3. file classification method as claimed in claim 2, which is characterized in that described to use two-way length memory network LSTM mould The type training sequence information predicts the incidence relation of the word information and the sequence information, comprising:
If the word information content in the sequence information is greater than preset length, partial words information is deleted, after obtaining deletion Sequence information, the word information content of the sequence information after the deletion is the preset length;
Sequence using the sequence information after being deleted described in two-way LSTM model training, after obtaining the word information and the deletion The incidence relation of column information.
4. file classification method as described in any one of claims 1-3, which is characterized in that described to remember net using two-way length Sequence information described in network LSTM model training obtains the incidence relation of the word information and the sequence information, comprising:
Using preceding to sequence information described in LSTM model training, the incidence relation of the word information and the sequence information is obtained ForWherein, indicate to include n word information in text to be sorted with n Phrase is (t1, t2..., tn-1, tn), p (t1, t2..., tk-1, tk) indicate (t occur1, t2..., tk-1, tk) this kind of sequence knot The probability of structure sentence information;
To the LSTM training sequence information after, obtains the word information and the incidence relation of the sequence information isWherein, p (tk+1 ,tk+2..., tn-1, tn) indicate (t occurk+1 , tk+2..., tn-1, tn) this kind of sequential structure sentence information probability;
It is single-layer bidirectional LSTM model in conjunction with forward and backward language message, log likelihood function is as follows:
Being associated with for current term information and forward direction word information and backward word information in the sequence information is predicted according to above formula Relationship.
5. file classification method as described in claim 1, which is characterized in that institute's predicate according to the text to be sorted Language information and preset model, before the term vector for obtaining the text to be sorted, further includes:
Word segmentation processing is carried out to the text to be sorted.
6. a kind of document sorting apparatus characterized by comprising
Module is obtained, for obtaining text to be sorted, wherein the text to be sorted includes: word information and sequence information, more A word information constitutes the sequence information;
The acquisition module obtains institute specifically for the word information and preset model according to the text to be sorted The term vector for stating text to be sorted obtains the text vector of the text to be sorted according to the term vector and preset algorithm;
Prediction module, for predicting the word letter using sequence information described in two-way length memory network LSTM model training The incidence relation of breath and the sequence information, wherein the two-way LSTM model includes: preceding to LSTM model and backward LSTM mould Type;
Module is integrated, for integrating the incidence relation of the text vector Yu the sequence information, and by the text after integration The incidence relation of this vector and the sequence information is input in default disaggregated model, obtains the classification of the text.
7. document sorting apparatus as claimed in claim 6, which is characterized in that further include: complementary module, if being used for the sequence Word information content in information is less than preset length, then supplements default number, the sequence information after being supplemented, the supplement The word information content of sequence information afterwards is the preset length;Using the sequence after being supplemented described in two-way LSTM model training Information, the incidence relation of the sequence information after obtaining the word information and the supplement.
8. document sorting apparatus as claimed in claim 7, which is characterized in that the complementary module, if being specifically used for the sequence Word information content in column information is greater than preset length, then deletes partial words information, the sequence information after being deleted, institute The word information content for stating the sequence information after deleting is the preset length;After being deleted described in two-way LSTM model training Sequence information, the incidence relation of the sequence information after obtaining the word information and the deletion.
9. the described in any item document sorting apparatus of claim 6-8, which is characterized in that the prediction module is specifically used for, and adopts With preceding to sequence information described in LSTM model training, obtain the word information and the incidence relation of the sequence information isWherein, the word in text to be sorted comprising n word information is indicated with n Group is (t1, t2..., tn-1, tn), p (t1, t2..., tk-1, tk) indicate (t occur1, t2..., tk-1, tk) this kind of sequential structure The probability of sentence information;To the LSTM training sequence information after, the word information and the sequence information are obtained Incidence relation isWherein, p (tk+1, tk+2..., tn-1, tn) indicate There is (tk+1, tk+2..., tn-1, tn) this kind of sequential structure sentence information probability;
It is single-layer bidirectional LSTM model in conjunction with forward and backward language message, log likelihood function is as follows:
Being associated with for current term information and forward direction word information and backward word information in the sequence information is predicted according to above formula Relationship.
10. document sorting apparatus as claimed in claim 6, which is characterized in that further include: processing module, for it is described to Classifying text carries out word segmentation processing.
CN201910523985.9A 2019-06-17 2019-06-17 Text classification method and device Active CN110232127B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910523985.9A CN110232127B (en) 2019-06-17 2019-06-17 Text classification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910523985.9A CN110232127B (en) 2019-06-17 2019-06-17 Text classification method and device

Publications (2)

Publication Number Publication Date
CN110232127A true CN110232127A (en) 2019-09-13
CN110232127B CN110232127B (en) 2021-11-16

Family

ID=67860025

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910523985.9A Active CN110232127B (en) 2019-06-17 2019-06-17 Text classification method and device

Country Status (1)

Country Link
CN (1) CN110232127B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111180025A (en) * 2019-12-18 2020-05-19 东北大学 Method and device for representing medical record text vector and inquiry system
CN111461904A (en) * 2020-04-17 2020-07-28 支付宝(杭州)信息技术有限公司 Object class analysis method and device
CN111930938A (en) * 2020-07-06 2020-11-13 武汉卓尔数字传媒科技有限公司 Text classification method and device, electronic equipment and storage medium
CN111930942A (en) * 2020-08-07 2020-11-13 腾讯云计算(长沙)有限责任公司 Text classification method, language model training method, device and equipment
CN113342933A (en) * 2021-05-31 2021-09-03 淮阴工学院 Multi-feature interactive network recruitment text classification method similar to double-tower model

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180240012A1 (en) * 2017-02-17 2018-08-23 Wipro Limited Method and system for determining classification of text
CN109243616A (en) * 2018-06-29 2019-01-18 东华大学 Mammary gland electronic health record joint Relation extraction and architectural system based on deep learning
CN109726268A (en) * 2018-08-29 2019-05-07 中国人民解放军国防科技大学 Text representation method and device based on hierarchical neural network
CN109740148A (en) * 2018-12-16 2019-05-10 北京工业大学 A kind of text emotion analysis method of BiLSTM combination Attention mechanism

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180240012A1 (en) * 2017-02-17 2018-08-23 Wipro Limited Method and system for determining classification of text
CN109243616A (en) * 2018-06-29 2019-01-18 东华大学 Mammary gland electronic health record joint Relation extraction and architectural system based on deep learning
CN109726268A (en) * 2018-08-29 2019-05-07 中国人民解放军国防科技大学 Text representation method and device based on hierarchical neural network
CN109740148A (en) * 2018-12-16 2019-05-10 北京工业大学 A kind of text emotion analysis method of BiLSTM combination Attention mechanism

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
马建红 等: "基于深度学习的专利分类方法", 《计算机工程》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111180025A (en) * 2019-12-18 2020-05-19 东北大学 Method and device for representing medical record text vector and inquiry system
CN111461904A (en) * 2020-04-17 2020-07-28 支付宝(杭州)信息技术有限公司 Object class analysis method and device
CN111461904B (en) * 2020-04-17 2022-06-21 支付宝(杭州)信息技术有限公司 Object class analysis method and device
CN111930938A (en) * 2020-07-06 2020-11-13 武汉卓尔数字传媒科技有限公司 Text classification method and device, electronic equipment and storage medium
CN111930942A (en) * 2020-08-07 2020-11-13 腾讯云计算(长沙)有限责任公司 Text classification method, language model training method, device and equipment
CN111930942B (en) * 2020-08-07 2023-08-15 腾讯云计算(长沙)有限责任公司 Text classification method, language model training method, device and equipment
CN113342933A (en) * 2021-05-31 2021-09-03 淮阴工学院 Multi-feature interactive network recruitment text classification method similar to double-tower model
CN113342933B (en) * 2021-05-31 2022-11-08 淮阴工学院 Multi-feature interactive network recruitment text classification method similar to double-tower model

Also Published As

Publication number Publication date
CN110232127B (en) 2021-11-16

Similar Documents

Publication Publication Date Title
CN110866117B (en) Short text classification method based on semantic enhancement and multi-level label embedding
CN105868184B (en) A kind of Chinese personal name recognition method based on Recognition with Recurrent Neural Network
CN110232127A (en) File classification method and device
US11314939B2 (en) Method and apparatus for performing hierarchiacal entity classification
CN111143576A (en) Event-oriented dynamic knowledge graph construction method and device
CN108363816A (en) Open entity relation extraction method based on sentence justice structural model
CN110083700A (en) A kind of enterprise's public sentiment sensibility classification method and system based on convolutional neural networks
CN108388651A (en) A kind of file classification method based on the kernel of graph and convolutional neural networks
KR20190063978A (en) Automatic classification method of unstructured data
CN109871955A (en) A kind of aviation safety accident causality abstracting method
Santos et al. Assessing the impact of contextual embeddings for Portuguese named entity recognition
CN108920482B (en) Microblog short text classification method based on lexical chain feature extension and LDA (latent Dirichlet Allocation) model
CN111222338A (en) Biomedical relation extraction method based on pre-training model and self-attention mechanism
CN109815400A (en) Personage's interest extracting method based on long text
Sartakhti et al. Persian language model based on BiLSTM model on COVID-19 corpus
CN115952292B (en) Multi-label classification method, apparatus and computer readable medium
CN113051887A (en) Method, system and device for extracting announcement information elements
Stewart et al. Seq2kg: an end-to-end neural model for domain agnostic knowledge graph (not text graph) construction from text
CN107357895A (en) A kind of processing method of the text representation based on bag of words
CN107526721A (en) A kind of disambiguation method and device to electric business product review vocabulary
Xu et al. Chinese event detection based on multi-feature fusion and BiLSTM
CN111858933A (en) Character-based hierarchical text emotion analysis method and system
CN111428502A (en) Named entity labeling method for military corpus
CN112528653B (en) Short text entity recognition method and system
Campbell et al. Content+ context networks for user classification in twitter

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant