CN110232127A - File classification method and device - Google Patents
File classification method and device Download PDFInfo
- Publication number
- CN110232127A CN110232127A CN201910523985.9A CN201910523985A CN110232127A CN 110232127 A CN110232127 A CN 110232127A CN 201910523985 A CN201910523985 A CN 201910523985A CN 110232127 A CN110232127 A CN 110232127A
- Authority
- CN
- China
- Prior art keywords
- information
- text
- word
- sequence
- sequence information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/247—Thesauruses; Synonyms
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides a kind of file classification method and device, is related to natural language processing field.This method comprises: obtaining text to be sorted;According to the word information and preset model of text to be sorted, the term vector of text to be sorted is obtained, according to term vector and prediction algorithm, obtains the text vector of text to be sorted;Using two-way length memory network LSTM model training sequence information, the incidence relation of word information and column information is predicted;The incidence relation of text vector and sequence information is integrated, and the incidence relation of text vector and sequence information after integration is input in default disaggregated model, obtains the classification of text.By the way that the probability for calculating current text sequence in sequence information and occurring is preset in using two-way LSTM model in sequence information, in combination with text vector, it can precisely predict the semantic feature and sequence signature of text to be sorted, it will be input to classifier after semantic feature and sequence signature integration, the exact classification of text to be sorted can be obtained.
Description
Technical field
The present invention relates to natural language processing technique fields, in particular to a kind of file classification method and device.
Background technique
It, can be with by text classification in natural language processing field (Natura Language Processing, NLP)
It helps user efficiently to manage text data, and provides base support for text mining, for example, the classification typesetting of news, shelves
The Classification Management of case, the information retrieval of search engine, answer search of question answering system etc..
In the prior art, rule-based text classification indicates usual directed quantity spatial model (Vector Space
Model, VSM), the language model three types of topic model and deep learning.Wherein, VSM model is obtained based on word
The classification information of text applies in general to long article notebook data;Topic model obtains text by learning the text semantic information of shallow-layer
This classification information;The learning text character representation that the language model of deep learning can automate.
But VSM model generally understands lost part semantic information and sequence information.Topic model can only learn the text of shallow-layer
This semantic information, obtained semanteme are fuzzy, coarsenesses.Do not have in the language model text classification of deep learning easy-to-use
Property and scalability, that is, be suitble to short text classification unsuitable long text classification, be suitble to long text classification again it is unsuitable
The classification of short text.
Summary of the invention
It is an object of the present invention in view of the deficiency of the prior art, a kind of file classification method and device are provided,
It is not strong to solve prior art textual classification model ease for use, the problems in classification inaccuracy.
To achieve the above object, technical solution used in the embodiment of the present invention is as follows:
In a first aspect, the embodiment of the invention provides a kind of file classification methods, comprising: obtain text to be sorted, wherein
Text to be sorted includes: word information and sequence information, and multiple word information constitute the sequence information;According to text to be sorted
Word information and preset model, obtain the term vector of text to be sorted, according to term vector and preset algorithm, obtain to be sorted
The text vector of text;Using two-way length memory network LSTM model training sequence information, predict that word information and sequence are believed
The incidence relation of breath, wherein two-way LSTM model includes: preceding to LSTM model and backward LSTM model;Integrate text vector with
The incidence relation of sequence information, and the incidence relation of text vector and the sequence information after integration is input to default classification
In model, the classification of text is obtained.
Optionally, using two-way length memory network LSTM model training sequence information, predict that word information and sequence are believed
Before the incidence relation of breath, further includes: if the word information content in sequence information is less than preset length, supplement default number
It is worth, the sequence information after being supplemented, the word information content of the sequence information after supplement is preset length;Using two-way LSTM
Sequence information after model training supplement, the incidence relation of the sequence information after obtaining word information and supplement.
Optionally, using two-way length memory network LSTM model training sequence information, the word information and sequence are predicted
The incidence relation of column information, comprising: if the word information content in sequence information is greater than preset length, delete partial words letter
Breath, the sequence information after being deleted, the word information content of the sequence information after deletion are preset length;Using two-way LSTM
Sequence information after model training deletion, the incidence relation of the sequence information after obtaining word information and deleting.
Optionally, it using two-way length memory network LSTM model training sequence information, obtains word information and sequence is believed
The incidence relation of breath, comprising:
Using preceding to LSTM model training sequence information, the incidence relation for obtaining word information and sequence information isWherein, indicate to include n word information in text to be sorted with n
Phrase is (t1, t2..., tn-1, tn), p (t1, t2..., tk-1, tk) indicate (t occur1, t2..., tk-1, tk) this kind of sequence knot
Structure sentence information;
To LSTM training sequence information after, obtains word information and the incidence relation of the sequence information isWherein, p (tk+1, tk+2..., tn-1, tn) indicate (t occurk+1,
tk+2..., tn-1, tn) this kind of sequential structure sentence information probability;
It is single-layer bidirectional LSTM model in conjunction with forward and backward language message, log likelihood function is as follows:
Current term information and forward direction word information and backward word information in the sequence information are predicted according to above formula
Incidence relation.
Optionally, according to the word information and preset model of text to be sorted, obtain the word of text to be sorted to
Before amount, further includes: treat classifying text and carry out word segmentation processing.
Second aspect, the embodiment of the invention also provides a kind of document sorting apparatus, comprising:
Module is obtained, for obtaining text to be sorted, wherein text to be sorted includes: word information and sequence information, more
A word information constitutes the sequence information;Obtain module, specifically for according to the word information of text to be sorted and
Preset model obtains the term vector of text to be sorted, according to term vector and preset algorithm, obtain the text of text to be sorted to
Amount;Prediction module, for using two-way length memory network LSTM model training sequence information, prediction word information and sequence are believed
The incidence relation of breath, wherein two-way LSTM model includes: preceding to LSTM model and backward LSTM model;Module is integrated, for whole
The incidence relation of text vector and sequence information is closed, and the incidence relation of the text vector and sequence information after integration is defeated
Enter into default disaggregated model, obtains the classification of text.
Optionally, further includes: complementary module, if being less than preset length for the word information content in sequence information,
Default number, the sequence information after supplement are supplemented, the word information content of the sequence information after supplement is preset length;It adopts
With the sequence information after two-way LSTM model training supplement, the incidence relation of the sequence information after obtaining word information and supplement.
Optionally, complementary module is deleted if being greater than preset length specifically for the word information content in sequence information
Partial words information, the sequence information after being deleted, the word information content of the sequence information after deletion are preset length;It adopts
Sequence information after being deleted with two-way LSTM model training, the incidence relation of the sequence information after obtaining word information and deleting.
Optionally, prediction module is specifically used for, and using preceding to LSTM model training sequence information, obtains word information and sequence
The incidence relation of column information isWherein, it is indicated in text to be sorted with n
Phrase comprising n word information is (t1, t2..., tn-1, tn), p (t1, t2..., tk-1, tk) indicate (t occur1, t2...,
tk-1, tk) this kind of sequential structure sentence information probability;To the LSTM training sequence information after, obtain word information with
The incidence relation of the sequence information isWherein, p (tk+1,
tk+2..., tn-1, tn) indicate (t occurk+1, tk+2..., tn-1, tn) this kind of sequential structure sentence information probability;In conjunction with it is preceding to
It is single-layer bidirectional LSTM model with backward language message, log likelihood function is as follows:
According to being associated with for current term information in above formula forecasting sequence information and forward direction word information and backward word information
Relationship.
Optionally, further includes: processing module carries out word segmentation processing for treating classifying text.
The beneficial effects of the present invention are:
The present invention provides a kind of file classification method and device, and text classification method includes: to obtain text to be sorted,
In, text to be sorted includes: word information and sequence information, and multiple word information constitute sequence information;According to text to be sorted
Word information and preset model, obtain the term vector of text to be sorted, according to term vector and prediction algorithm, obtain to be sorted
The text vector of text;Using two-way length memory network LSTM model training sequence information, word information and column information are predicted
Incidence relation, wherein two-way LSTM model include: before to LSTM model and backward LSTM model;Integrate text vector and sequence
The incidence relation of column information, and the incidence relation of text vector and sequence information after integration is input to default disaggregated model
In, obtain the classification of text.The present invention is calculated by being preset in sequence information in sequence information using two-way LSTM model
The probability that current text sequence occurs can precisely predict the semantic feature and sequence of text to be sorted in combination with text vector
Feature will be input to classifier after semantic feature and sequence signature integration, can obtain the exact classification of text to be sorted.
Detailed description of the invention
In order to illustrate the technical solution of the embodiments of the present invention more clearly, below will be to needed in the embodiment attached
Figure is briefly described, it should be understood that the following drawings illustrates only certain embodiments of the present invention, therefore is not construed as pair
The restriction of range for those of ordinary skill in the art without creative efforts, can also be according to this
A little attached drawings obtain other relevant attached drawings.
Fig. 1 is a kind of file classification method flow diagram provided by the invention;
Fig. 2 is a kind of file classification method schematic diagram provided by the invention;
Fig. 3 is the file classification method flow diagram that one embodiment of the invention provides;
Fig. 4 is the file classification method flow diagram that further embodiment of this invention provides;
Fig. 5 be another embodiment of the present invention provides file classification method flow diagram;
Fig. 6 is a kind of document sorting apparatus structural schematic diagram provided by the invention;
Fig. 7 is the document sorting apparatus structural schematic diagram that one embodiment of the invention provides;
Fig. 8 is the document sorting apparatus structural schematic diagram that further embodiment of this invention provides;
Fig. 9 be another embodiment of the present invention provides document sorting apparatus structural schematic diagram.
Specific embodiment
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention
In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is
A part of the embodiment of the present invention, instead of all the embodiments.
Explanation of nouns:
Word2vec (word to vector): being the correlation model that a group is used to generate term vector, can use current
Word predicts surrounding word, and predicts current term using the word of surrounding, and realizes and generate term vector.Training completion
Afterwards, word2vec model can be used to map each word to a vector, can be used to indicate word to the relationship between word.
ELMo model: a kind of novel depth contextualized vocabulary sign, word can be carried out complex characteristic (such as syntax and semantic) and
Variation of the word in language context is modeled and (is modeled to polysemant).
Shot and long term memory network (Long Short-Term Memory, abbreviation LSTM): it is placed in LSTM model
Three fan doors, are called input gate respectively, forget door and out gate.One information enters in the network of LSTM, can be according to rule
To determine whether useful.The information for only meeting algorithm certification can just leave, and the information not being inconsistent then passes through forgetting door and passes into silence.
Fig. 1 is a kind of file classification method flow diagram provided by the invention.This method can be by computer, server
It is executed Deng the equipment with processing function, as shown in Figure 1, this method comprises:
S110, text to be sorted is obtained, wherein text to be sorted includes: word information and sequence information, multiple word letters
Breath constitutes the sequence information.
Text to be sorted can be one section of brief sentence, word, be also possible to a document, such as news report, literature
Works.Text to be sorted herein can be short text, be also possible to long text, herein with no restrictions.
Wherein, text to be sorted either short text or long text, content of text is made of sentence information, and every
A sentence information is made of multiple word information, wherein the state before forming sentence in advance by multiple word information is known as sequence
Column information.
For example, including the sentence information of " sun today is bright and beautiful " in a text, wherein " today ", " sunlight " and " bright
Fawn " the word information of instruction among the above.It include more by the sequence information that word information therein forms for this sentence information
It is a, such as: today/sunlight/bright and beautiful, today/bright and beautiful/sunlight, sunlight/bright and beautiful/today, sunlight/today/bright and beautiful, bright and beautiful/sunlight/
Today, bright and beautiful/today/sunlight.It is apparent that today/sunlight/this bright and beautiful sequence information is just required in composition text sentence
Sequence, therefore, state before taking multiple word information to form sentence in advance is sequence information.
S120, word information and preset model according to text to be sorted, obtain the term vector of text to be sorted, according to
Term vector and preset algorithm obtain the text vector of text to be sorted.
Since the information content that lexical information database includes is very big, will to look like in lexical information database, similar word is united
Meter, therefore, by each word information MAP into Spatial Dimension, indicates each with term vector (Word embedding)
Word information.
During obtaining term vector, due to the word Limited information of to be sorted point of sheet, for make to train resulting word to
Amount can precisely indicate word information, obtain the term vector of high quality in combination with preset model.
Above-mentioned preset model can be Word2vec model or ELMo model, and the present invention is with no restrictions.
In Spatial Dimension, its distance of the word information of similar import is closer, such as " sun today in model sentence
It is bright and beautiful ", there are three word information, " today ", " sunlight " and " bright and beautiful ", then with " today " semantic similarity in this sentence information
Distance in Spatial Dimension of word " tomorrow ", " day after tomorrow ", " yesterday " etc. it is close with " today " distance.Indicate that this is similar
In word information and the sentence information when other word information group sentences, the similar import of expression, and class categories are close.
It should be noted that trained term vector model can also be used in file classification method provided by the invention,
Such as 8,000,000 term vector data of open source in 2018.
Further, after the word information whole word vector expression in text to be sorted, the word in text is believed
Breath combines preset algorithm, obtains the text vector of text to be sorted, in order to express this using a specific vector
Text information.
The preset algorithm is that the corresponding term vector of word information in text to be sorted is summed, and is averaged.If n
The phrase of word is (t1, t ..., tn-1, tn), then calculation are as follows:
In formula, v (t) indicates the expression way for converting t word to space vector.
By above formula, the text vector of text to be sorted can be obtained.
S130, using two-way length memory network LSTM model training sequence information, predict word information and sequence information
Incidence relation, wherein two-way LSTM model include: before to LSTM model and backward LSTM model.
Wherein, traditional LSTM model is a kind of language message of memory sequences front, be may learn in sentence information
Estimation of the word information of front sequence to sequence word information below, but cannot learn to sequence word information below to preceding
The estimation of face sequence word information.Therefore, file classification method provided by the invention uses two-way LSTM mould in sequence information
Type is trained, and the accuracy of model learning sequence information can be improved.
Fig. 2 is a kind of file classification method schematic diagram provided by the invention.As shown in Fig. 2, being shown with word information table wait divide
Semantic feature in class text, with it is preceding to LSTM model and backward LSTM model respectively to the forward direction word in each word information
Language and backward word are trained, final output in sequence information with the forward direction word of the information-related relationship of current term and
Backward word.
It should be noted that above-mentioned incidence relation instruction can be combined together in sequence information with current term information
Form the forward direction word and backward word of sentence information.
A kind of embodiment, it is preceding to LSTM model training sequence using in the example above sentence " sun today is bright and beautiful "
It can may be " brilliance " for " bright and beautiful " after available " sunlight ", but in training when information, it can also after " sunlight "
A possibility that thinking " very good ", and being associated with " today " after " sunlight " is with regard to smaller.The preceding purpose to LSTM model be in order to
Training obtain can backward word information associated with " sunlight " probability.
When after use to LSTM model training sequence information, can be before available " sunlight " " today ", it can also be with
For " tomorrow " or " here ", thus obtain can forward direction word associated with " sunlight " probability.
Further, through the foregoing embodiment, it for the sentence information of " sun today is bright and beautiful ", can extend to obtain: " modern
It is sunny ", " sun today is very good ", the semantic similarities such as " tomorrow is sunny " sentence information.
It should be noted that file classification method provided by the invention, in sequence information using preceding to LSTM model and
Backward LSTM model training text obtains the forward direction word and backward word with the information-related relationship of current term, not more than
Embodiment is stated as limitation.
Optionally, word letter that is above-mentioned to use two-way length memory network LSTM model training sequence information, being predicted
Breath can have the vector of foregoing sequences information to be indicated with the incidence relation of sequence information by memory, that is, before to
LSTM model and backward LSTM model, the result exported can have the vector of foregoing sequences information for memory.
Optionally, there is the vector of forward sequence information since forward direction LSTM model can export memory, and backward LSTM model
Can export memory have after to sequence information vector, so, above-mentioned memory has the vector of foregoing sequences information, can be by right
The vector that forward direction LSTM model and backward LSTM model export respectively carries out calculating obtained vector average value.For example, can be with
Institute's directed quantity is added, its average value is then taken.It should be noted that if vector is various dimensions vector, then above-mentioned memory
The vector for having foregoing sequences information is that each dimension of each vector is carried out average rear obtained various dimensions respectively
Vector.
As described above, by characterizing predicted word information and sequence information with the average value of foregoing sequences information
Incidence relation is conducive to macroscopical estimated sequence information in this way, while being also beneficial to keep in dimension with the vector of semantic information
Unanimously.
S140, the incidence relation for integrating text vector and sequence information, and by after integration text vector and the sequence
The incidence relation of information is input in default disaggregated model, obtains the classification of text.
Due to the sentence information for including in text be it is a variety of, will be obtained in step S120 in text vector and step S130
To text in sequence information prediction current term forward direction word information and backward word information integrated, can be obtained
Multiple expanded texts of current text to be sorted.It wherein, may include: to be included in expanded text with aforementioned texts vector
Semantic and foregoing sequences information vector.
Multiple expanded texts are input in default disaggregated model, default disaggregated model can automatically derive the class of the text
Not.
It should be noted that the default disaggregated model can be softmax disaggregated model, but the present invention not with
Softmax disaggregated model is limitation.
Document Classification Method provided by the invention is believed by being preset in sequence using two-way LSTM model in sequence information
The probability that current text sequence occurs is calculated in breath, in combination with text vector, can precisely predict that the semanteme of text to be sorted is special
It seeks peace sequence signature, classifier will be input to after semantic feature and sequence signature integration, the exact classification of text to be sorted can be obtained.
Fig. 3 provides file classification method flow diagram for one embodiment of the invention, as shown in figure 3, executing step
Before S130, also need to pre-process the word information in sequence information, which includes:
If the word information content in S210, sequence information is less than preset length, default number is supplemented, after obtaining supplement
Sequence information, the word information content of the sequence information after supplement is preset length.
Since LSTM mode input word information is that the data that can be handled are fixed length, therefore, needed in input
The word information of sequence information is pre-processed.For example, taking the word for averagely including in the sequence information in text to be sorted
Information is as preset length, calculation formula are as follows:
Wherein,Indicate the length for the word information for averagely including in sequence information, siIndicate textual data
Sequence in, i are sequence number, and N is sequence sum.
If word information content in sequence information is less than preset length, default number is supplemented to average length, herein
Default number be 0.
When another embodiment such as the average length of the word information in sequence information is 8, " today pretty good sunlight of weather
It is bright and beautiful ", then first input of LSTM model is v (" today "), second input is v (" weather ") ..., the 7th input is
0, n-th of input is 0.Here v (x) indicates the vector value of word information x.
S220, the sequence using the sequence information after two-way LSTM model training supplement, after obtaining word information and supplement
The incidence relation of information.
Fig. 4 provides file classification method flow diagram for one embodiment of the invention, as shown in figure 4, in sequence information
Word information carry out pretreated method, further includes:
If the word information content in S310, sequence information is greater than preset length, partial words information is deleted, is deleted
Sequence information after removing, the word information content of the sequence information after deletion are preset length.
I.e. when the word number for including in sequence information is more than hyper parameter average length, then the word of overage is deleted
Language.The deletion is the term vector deleted after the average length.
Another embodiment, as the word information in sequence information average length be 3 when, " today, the pretty good sunlight of weather was bright
Fawn ", then first input of LSTM model is v (" today "), second input is v (" weather "), third input is v
(" good "), the term vector that subsequent word information represents are deleted.Here v (x) indicates the vector value of word information x.
It should be noted that the average length of the word information in sequence information is not limitation with above-described embodiment.
S320, the sequence using the sequence information after the deletion of two-way LSTM model training, after obtaining word information and deleting
The incidence relation of information.
Further, step is 130, comprising:
The incidence relation of word information and sequence information is obtained to LSTM model training sequence information using preceding, is calculated public
Formula are as follows:
In formula, n indicates that the phrase in text to be sorted comprising n word information is (t1, t2..., tn-1, tn), tkIt indicates
K-th of word information, p (t1, t2..., tk) indicate this kind (t occur1, t2..., tk-1, tk) the sentence information of sequential structure is
Probability.
There are many compositions of " sun today is bright and beautiful " the sentence information known to through the foregoing embodiment, but target composition is only
There is one kind, the p (t1, t2..., tk) indicate the probability that composition target sequence structured statement information occurs.
To LSTM training sequence information after, the incidence relation of word information and the sequence information is obtained, is calculated public
Formula are as follows:
Wherein p (tk+1, tk+2..., tn-1, tn) indicate (t occurk+1, tk+2..., tn-1, tn) this kind of sequential structure language
Sentence information is probability.
In conjunction with it is preceding to information and backward language message be single-layer bidirectional LSTM model, log likelihood function is as follows:
It can be with current term information in forecasting sequence information and forward direction word information and backward word information according to above formula
Incidence relation.
Fig. 5 be another embodiment of the present invention provides file classification method flow diagram, as shown in figure 5, executing step
Before rapid S120, it is also necessary to treat classifying text and be pre-processed, which includes:
S111, classifying text progress word segmentation processing is treated.
Before the expression of learning text vector, need to carry out word segmentation processing to text, participle is by continuous sentence sequence
Column split into the process of sequence of terms according to certain specification again.In Chinese text, the words in sentence is all connected to
It together, and in English text is using space as delimiter between word and word, and Chinese can pass through word, sentence, section
Simply demarcate Deng by significantly decomposing symbol, only on word none form section break, word links together with word
As a complete expression written form.Therefore, accurate word information is obtained, needs to carry out word segmentation processing.
For example, sentence " this cartoon is firmly got child and liked " is segmented, this available portion/cartoon/
Firmly get/child// like these words, only participle accurately on the basis of, can just obtain accurate sequence information and word
The incidence relation of language information.
Document Classification Method provided by the invention, when carrying out word segmentation processing, can using based on string matching algorithm,
Based on statistics and segmenting method, the mechanical Chinese word segmentation algorithm of machine learning etc., the segmentation methods applied when at this to word segmentation processing
With no restrictions.
Further, the disaggregated model that the present invention uses is when carrying out text classification, training process are as follows:
Obtain different classes of samples of text, wherein samples of text includes tag along sort.The procedural representation obtains
Know the text of a large amount of classifications, such as obtain a large amount of " sport " class texts, then a large amount of known class text is done at participle respectively
Reason, can be obtained the multiple words for including in each text, and further do " sport " to the word information after each text participle
Class label.For example, can realize the classification to a certain piece document by the disaggregated model, the document can be to be made of plurality of articles
Text collection, each text in text collection includes the text data of the compositions such as article, sentence, word.
Further, the characteristic information that the word of " sport " class is described in each text is extracted respectively, and classifier study is a large amount of
" sport " class text, and then the common trait of the word information in " sport " class text included can be somebody's turn to do.
When any text input to disaggregated model, classifier can learn whether the text belongs to " sport " class automatically.
It should be noted that file classification method provided in an embodiment of the present invention, is not limitation with " sport " class, may be used also
To include other classifications.
There is a kind of embodiment, file classification method provided by the invention can also be based on sentence information, text to be sorted
Include multiple sentence information in this, by the way that sentence information is converted to corresponding sentence vector, and further obtains text to be sorted
Text vector in this about sentence information vector.
Further, it by two-way LSTM model, is respectively trained and appears in preceding paragraph in current statement information and backward
The related information of sentence information, and by the related information in conjunction with text vector, file classification method can be obtained.
It should be noted that the long text information for including in text can be carried out essence by based on sentence information
Really classification, improves the accuracy and ease for use of sorting algorithm.
Fig. 6 is document sorting apparatus structural schematic diagram provided by the invention, as shown in fig. 6, the device specifically includes: being obtained
Module 601, integrates module 603 at prediction module 602.Wherein:
Module 601 is obtained, for obtaining text to be sorted, wherein text to be sorted includes: word information and sequence letter
Breath, multiple word information constitute the sequence information.
Module 601 is obtained, specifically for the word information and preset model according to text to be sorted, is obtained wait divide
The term vector of class text obtains the text vector of text to be sorted according to term vector and preset algorithm.
Prediction module 602 predicts word information for using two-way length memory network LSTM model training sequence information
With the incidence relation of sequence information, wherein two-way LSTM model includes: preceding to LSTM model and backward LSTM model.
Module 603 is integrated, for integrating the incidence relation of text vector and sequence information, and by the text after integration
The incidence relation of vector and sequence information is input in default disaggregated model, obtains the classification of text.
Fig. 7 is the document sorting apparatus structural schematic diagram that one embodiment of the invention provides, as shown in fig. 7, the present invention provides
Device further include: complementary module 604, in which:
Complementary module 604 supplements default number if being less than preset length for the word information content in sequence information
It is worth, the sequence information after being supplemented, the word information content of the sequence information after supplement is preset length.Using two-way LSTM
Sequence information after model training supplement, the incidence relation of the sequence information after obtaining word information and supplement.
Further, complementary module 604, if being greater than preset length specifically for the word information content in sequence information,
Partial words information is then deleted, the word information content of the sequence information after being deleted, the sequence information after deletion is default
Length.Sequence information after being deleted using two-way LSTM model training, the pass of the sequence information after obtaining word information and deleting
Connection relationship.
Optionally, prediction module 602 are specifically used for obtaining word information to LSTM model training sequence information using preceding
Incidence relation with sequence information isWherein, n indicates text to be sorted
In the phrase comprising n word information for (t1, t2..., tn-1, tn), p (t1, t2..., tk-1, tk) indicate (t occur1,
t2..., tk-1, tk) this kind of sequential structure sentence information probability;To the LSTM training sequence information after, word is obtained
Information and the incidence relation of the sequence information areWherein, p (tk+1,
tk+2..., tn-1, tn) indicate (t occurk+1, tk+2..., tn-1, tn) this kind of sequential structure sentence information probability;
It is single-layer bidirectional LSTM model in conjunction with forward and backward language message, log likelihood function is as follows:
According to being associated with for current term information in above formula forecasting sequence information and forward direction word information and backward word information
Relationship.
Fig. 8 is the document sorting apparatus structural schematic diagram that further embodiment of this invention provides, as shown in figure 8, the present invention mentions
The device of confession further include: processing module 605, in which:
Processing module 605 carries out word segmentation processing for treating classifying text.
The method that above-mentioned apparatus is used to execute previous embodiment offer, it is similar that the realization principle and technical effect are similar, herein not
It repeats again.
The above module can be arranged to implement one or more integrated circuits of above method, such as: one
Or multiple specific integrated circuits (Application Specific Integrated Circuit, abbreviation ASIC), or, one
Or multi-microprocessor (digital singnal processor, abbreviation DSP), or, one or more field programmable gate
Array (Field Programmable Gate Array, abbreviation FPGA) etc..For another example, when some above module passes through processing elements
When the form of part scheduler program code is realized, which can be general processor, such as central processing unit (Central
Processing Unit, abbreviation CPU) or it is other can be with the processor of caller code.For another example, these modules can integrate
Together, it is realized in the form of system on chip (system-on-a-chip, abbreviation SOC).
Fig. 9 is the document sorting apparatus structural schematic diagram that further embodiment of this invention provides, which can integrate in end
The chip of end equipment or terminal device, the terminal can be the calculating equipment for having classification processing function.
As shown in figure 9, the device includes: memory 901, processor 902.
Memory 901 is for storing program, the program that processor 902 calls memory 901 to store, to execute the above method
Embodiment.Specific implementation is similar with technical effect, and which is not described herein again.
Optionally, the present invention also provides a kind of program product, such as computer readable storage medium, including program, the journeys
Sequence is when being executed by processor for executing above method embodiment.
In several embodiments provided by the present invention, it should be understood that disclosed device and method can pass through it
Its mode is realized.For example, the apparatus embodiments described above are merely exemplary, for example, the division of the unit, only
Only a kind of logical function partition, there may be another division manner in actual implementation, such as multiple units or components can be tied
Another system is closed or is desirably integrated into, or some features can be ignored or not executed.Another point, it is shown or discussed
Mutual coupling, direct-coupling or communication connection can be through some interfaces, the INDIRECT COUPLING or logical of device or unit
Letter connection can be electrical property, mechanical or other forms.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit
The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple
In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme
's.
It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit
It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list
Member both can take the form of hardware realization, can also realize in the form of hardware adds SFU software functional unit.
The above-mentioned integrated unit being realized in the form of SFU software functional unit can store and computer-readable deposit at one
In storage media.Above-mentioned SFU software functional unit is stored in a storage medium, including some instructions are used so that a computer
Equipment (can be personal computer, server or the network equipment etc.) or processor (English: processor) execute this hair
The part steps of bright each embodiment the method.And storage medium above-mentioned includes: USB flash disk, mobile hard disk, read-only memory
(English: Read-Only Memory, abbreviation: ROM), random access memory (English: Random Access Memory, letter
Claim: RAM), the various media that can store program code such as magnetic or disk.
Claims (10)
1. a kind of file classification method characterized by comprising
Obtain text to be sorted, wherein the text to be sorted includes: word information and sequence information, multiple word letters
Breath constitutes the sequence information;
According to the word information and preset model of the text to be sorted, the term vector of the text to be sorted is obtained,
According to the term vector and preset algorithm, the text vector of the text to be sorted is obtained;
Using sequence information described in two-way length memory network LSTM model training, predict that the word information and the sequence are believed
The incidence relation of breath, wherein the two-way LSTM model includes: preceding to LSTM model and backward LSTM model;
Integrate the incidence relation of the text vector Yu the sequence information, and by after integration the text vector and the sequence
The incidence relation of column information is input in default disaggregated model, obtains the classification of the text.
2. file classification method as described in claim 1, which is characterized in that described to use two-way length memory network LSTM mould
The type training sequence information, before the incidence relation for predicting the word information and the sequence information, further includes:
If the word information content in the sequence information is less than preset length, default number is supplemented, the sequence after being supplemented
Column information, the word information content of the sequence information after the supplement are the preset length;
Sequence using the sequence information after being supplemented described in two-way LSTM model training, after obtaining the word information and the supplement
The incidence relation of column information.
3. file classification method as claimed in claim 2, which is characterized in that described to use two-way length memory network LSTM mould
The type training sequence information predicts the incidence relation of the word information and the sequence information, comprising:
If the word information content in the sequence information is greater than preset length, partial words information is deleted, after obtaining deletion
Sequence information, the word information content of the sequence information after the deletion is the preset length;
Sequence using the sequence information after being deleted described in two-way LSTM model training, after obtaining the word information and the deletion
The incidence relation of column information.
4. file classification method as described in any one of claims 1-3, which is characterized in that described to remember net using two-way length
Sequence information described in network LSTM model training obtains the incidence relation of the word information and the sequence information, comprising:
Using preceding to sequence information described in LSTM model training, the incidence relation of the word information and the sequence information is obtained
ForWherein, indicate to include n word information in text to be sorted with n
Phrase is (t1, t2..., tn-1, tn), p (t1, t2..., tk-1, tk) indicate (t occur1, t2..., tk-1, tk) this kind of sequence knot
The probability of structure sentence information;
To the LSTM training sequence information after, obtains the word information and the incidence relation of the sequence information isWherein, p (tk+1 ,tk+2..., tn-1, tn) indicate (t occurk+1 ,
tk+2..., tn-1, tn) this kind of sequential structure sentence information probability;
It is single-layer bidirectional LSTM model in conjunction with forward and backward language message, log likelihood function is as follows:
Being associated with for current term information and forward direction word information and backward word information in the sequence information is predicted according to above formula
Relationship.
5. file classification method as described in claim 1, which is characterized in that institute's predicate according to the text to be sorted
Language information and preset model, before the term vector for obtaining the text to be sorted, further includes:
Word segmentation processing is carried out to the text to be sorted.
6. a kind of document sorting apparatus characterized by comprising
Module is obtained, for obtaining text to be sorted, wherein the text to be sorted includes: word information and sequence information, more
A word information constitutes the sequence information;
The acquisition module obtains institute specifically for the word information and preset model according to the text to be sorted
The term vector for stating text to be sorted obtains the text vector of the text to be sorted according to the term vector and preset algorithm;
Prediction module, for predicting the word letter using sequence information described in two-way length memory network LSTM model training
The incidence relation of breath and the sequence information, wherein the two-way LSTM model includes: preceding to LSTM model and backward LSTM mould
Type;
Module is integrated, for integrating the incidence relation of the text vector Yu the sequence information, and by the text after integration
The incidence relation of this vector and the sequence information is input in default disaggregated model, obtains the classification of the text.
7. document sorting apparatus as claimed in claim 6, which is characterized in that further include: complementary module, if being used for the sequence
Word information content in information is less than preset length, then supplements default number, the sequence information after being supplemented, the supplement
The word information content of sequence information afterwards is the preset length;Using the sequence after being supplemented described in two-way LSTM model training
Information, the incidence relation of the sequence information after obtaining the word information and the supplement.
8. document sorting apparatus as claimed in claim 7, which is characterized in that the complementary module, if being specifically used for the sequence
Word information content in column information is greater than preset length, then deletes partial words information, the sequence information after being deleted, institute
The word information content for stating the sequence information after deleting is the preset length;After being deleted described in two-way LSTM model training
Sequence information, the incidence relation of the sequence information after obtaining the word information and the deletion.
9. the described in any item document sorting apparatus of claim 6-8, which is characterized in that the prediction module is specifically used for, and adopts
With preceding to sequence information described in LSTM model training, obtain the word information and the incidence relation of the sequence information isWherein, the word in text to be sorted comprising n word information is indicated with n
Group is (t1, t2..., tn-1, tn), p (t1, t2..., tk-1, tk) indicate (t occur1, t2..., tk-1, tk) this kind of sequential structure
The probability of sentence information;To the LSTM training sequence information after, the word information and the sequence information are obtained
Incidence relation isWherein, p (tk+1, tk+2..., tn-1, tn) indicate
There is (tk+1, tk+2..., tn-1, tn) this kind of sequential structure sentence information probability;
It is single-layer bidirectional LSTM model in conjunction with forward and backward language message, log likelihood function is as follows:
Being associated with for current term information and forward direction word information and backward word information in the sequence information is predicted according to above formula
Relationship.
10. document sorting apparatus as claimed in claim 6, which is characterized in that further include: processing module, for it is described to
Classifying text carries out word segmentation processing.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910523985.9A CN110232127B (en) | 2019-06-17 | 2019-06-17 | Text classification method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910523985.9A CN110232127B (en) | 2019-06-17 | 2019-06-17 | Text classification method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110232127A true CN110232127A (en) | 2019-09-13 |
CN110232127B CN110232127B (en) | 2021-11-16 |
Family
ID=67860025
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910523985.9A Active CN110232127B (en) | 2019-06-17 | 2019-06-17 | Text classification method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110232127B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111180025A (en) * | 2019-12-18 | 2020-05-19 | 东北大学 | Method and device for representing medical record text vector and inquiry system |
CN111461904A (en) * | 2020-04-17 | 2020-07-28 | 支付宝(杭州)信息技术有限公司 | Object class analysis method and device |
CN111930938A (en) * | 2020-07-06 | 2020-11-13 | 武汉卓尔数字传媒科技有限公司 | Text classification method and device, electronic equipment and storage medium |
CN111930942A (en) * | 2020-08-07 | 2020-11-13 | 腾讯云计算(长沙)有限责任公司 | Text classification method, language model training method, device and equipment |
CN113342933A (en) * | 2021-05-31 | 2021-09-03 | 淮阴工学院 | Multi-feature interactive network recruitment text classification method similar to double-tower model |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180240012A1 (en) * | 2017-02-17 | 2018-08-23 | Wipro Limited | Method and system for determining classification of text |
CN109243616A (en) * | 2018-06-29 | 2019-01-18 | 东华大学 | Mammary gland electronic health record joint Relation extraction and architectural system based on deep learning |
CN109726268A (en) * | 2018-08-29 | 2019-05-07 | 中国人民解放军国防科技大学 | Text representation method and device based on hierarchical neural network |
CN109740148A (en) * | 2018-12-16 | 2019-05-10 | 北京工业大学 | A kind of text emotion analysis method of BiLSTM combination Attention mechanism |
-
2019
- 2019-06-17 CN CN201910523985.9A patent/CN110232127B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180240012A1 (en) * | 2017-02-17 | 2018-08-23 | Wipro Limited | Method and system for determining classification of text |
CN109243616A (en) * | 2018-06-29 | 2019-01-18 | 东华大学 | Mammary gland electronic health record joint Relation extraction and architectural system based on deep learning |
CN109726268A (en) * | 2018-08-29 | 2019-05-07 | 中国人民解放军国防科技大学 | Text representation method and device based on hierarchical neural network |
CN109740148A (en) * | 2018-12-16 | 2019-05-10 | 北京工业大学 | A kind of text emotion analysis method of BiLSTM combination Attention mechanism |
Non-Patent Citations (1)
Title |
---|
马建红 等: "基于深度学习的专利分类方法", 《计算机工程》 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111180025A (en) * | 2019-12-18 | 2020-05-19 | 东北大学 | Method and device for representing medical record text vector and inquiry system |
CN111461904A (en) * | 2020-04-17 | 2020-07-28 | 支付宝(杭州)信息技术有限公司 | Object class analysis method and device |
CN111461904B (en) * | 2020-04-17 | 2022-06-21 | 支付宝(杭州)信息技术有限公司 | Object class analysis method and device |
CN111930938A (en) * | 2020-07-06 | 2020-11-13 | 武汉卓尔数字传媒科技有限公司 | Text classification method and device, electronic equipment and storage medium |
CN111930942A (en) * | 2020-08-07 | 2020-11-13 | 腾讯云计算(长沙)有限责任公司 | Text classification method, language model training method, device and equipment |
CN111930942B (en) * | 2020-08-07 | 2023-08-15 | 腾讯云计算(长沙)有限责任公司 | Text classification method, language model training method, device and equipment |
CN113342933A (en) * | 2021-05-31 | 2021-09-03 | 淮阴工学院 | Multi-feature interactive network recruitment text classification method similar to double-tower model |
CN113342933B (en) * | 2021-05-31 | 2022-11-08 | 淮阴工学院 | Multi-feature interactive network recruitment text classification method similar to double-tower model |
Also Published As
Publication number | Publication date |
---|---|
CN110232127B (en) | 2021-11-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110866117B (en) | Short text classification method based on semantic enhancement and multi-level label embedding | |
CN105868184B (en) | A kind of Chinese personal name recognition method based on Recognition with Recurrent Neural Network | |
CN110232127A (en) | File classification method and device | |
US11314939B2 (en) | Method and apparatus for performing hierarchiacal entity classification | |
CN111143576A (en) | Event-oriented dynamic knowledge graph construction method and device | |
CN108363816A (en) | Open entity relation extraction method based on sentence justice structural model | |
CN110083700A (en) | A kind of enterprise's public sentiment sensibility classification method and system based on convolutional neural networks | |
CN108388651A (en) | A kind of file classification method based on the kernel of graph and convolutional neural networks | |
KR20190063978A (en) | Automatic classification method of unstructured data | |
CN109871955A (en) | A kind of aviation safety accident causality abstracting method | |
Santos et al. | Assessing the impact of contextual embeddings for Portuguese named entity recognition | |
CN108920482B (en) | Microblog short text classification method based on lexical chain feature extension and LDA (latent Dirichlet Allocation) model | |
CN111222338A (en) | Biomedical relation extraction method based on pre-training model and self-attention mechanism | |
CN109815400A (en) | Personage's interest extracting method based on long text | |
Sartakhti et al. | Persian language model based on BiLSTM model on COVID-19 corpus | |
CN115952292B (en) | Multi-label classification method, apparatus and computer readable medium | |
CN113051887A (en) | Method, system and device for extracting announcement information elements | |
Stewart et al. | Seq2kg: an end-to-end neural model for domain agnostic knowledge graph (not text graph) construction from text | |
CN107357895A (en) | A kind of processing method of the text representation based on bag of words | |
CN107526721A (en) | A kind of disambiguation method and device to electric business product review vocabulary | |
Xu et al. | Chinese event detection based on multi-feature fusion and BiLSTM | |
CN111858933A (en) | Character-based hierarchical text emotion analysis method and system | |
CN111428502A (en) | Named entity labeling method for military corpus | |
CN112528653B (en) | Short text entity recognition method and system | |
Campbell et al. | Content+ context networks for user classification in twitter |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |