CN106528776A - Text classification method and device - Google Patents
Text classification method and device Download PDFInfo
- Publication number
- CN106528776A CN106528776A CN201610976847.2A CN201610976847A CN106528776A CN 106528776 A CN106528776 A CN 106528776A CN 201610976847 A CN201610976847 A CN 201610976847A CN 106528776 A CN106528776 A CN 106528776A
- Authority
- CN
- China
- Prior art keywords
- text
- sentence
- tag value
- corresponding tag
- classification
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a text classification method. The method comprises the following steps of: carrying out word segmentation on each sentence in a to-be-classified text; indexing each word in the text to obtain corresponding word indexes; synthesizing the word indexes or each sentence into an index vector which is used as a sentence index of each sentence; inputting the sentence indexes into a deep learning classification model to obtain a mark numerical value corresponding to the deep learning classification information of each sentence; carrying out statistic on the numbers of the mark numerical values, and taking the mark numerical value with the maximum number as a mark numerical value corresponding to the deep learning classification information of the text, wherein the mark numerical value corresponding to the deep learning classification information is used as a mark numerical value of text classification information. The invention furthermore provides a device corresponding to the method.
Description
Technical field
The present invention relates to a kind of method and apparatus of text classification, more particularly to one kind enters style of writing using depth learning technology
The method and apparatus of this classification.
Background technology
During natural language processing, base of the text classification as applications such as classifying content, sentiment analysis, topic identifications
Plinth, it appears particularly important.
Text classification is first had to according to content of text participle, and participle is converted into vector representation, and prior art includes SVM
(support vector machine), Logistics (logic), RandomForest (random forest), Bayes (Bayes), KNN (most faces
Closely).SVM, Logistics, RandomForest are the higher-dimension discrimination models of word-based vector mode, feature is relied on and is compared
By force.The Bayesian model of Bayes, KNN based on statistical, higher-dimension discrimination model subject matter are that vector table reaches and above cannot
The semantic information of complete text is characterized, and the decision boundary of Bayesian model determines it is extremely difficult.
The content of the invention
It is an object of the invention to provide a kind of method and apparatus of text classification, with depth learning technology, can be right
The participle of text is carried out than the more preferable feature representation of conventional art, and then can be obtained than the more preferable classification results of conventional art.
According to above-mentioned purpose, the present invention provides a kind of file classification method, and methods described includes:The text of classification will be needed
In each sentence carry out word segmentation processing;Each word in text described in indexation, obtains corresponding glossarial index;Will be each
The glossarial index of individual sentence synthesizes index vector, indexes as the sentence of each sentence;By sentence index input deep learning
Disaggregated model, obtains the corresponding tag value of deep learning classification information of each sentence;Number is carried out to the tag value
Statistics, and using number most tag values as the text the corresponding tag value of deep learning classification information, it is described
Tag value of the corresponding tag value of deep learning classification information as the text classification information.
In one embodiment, the training method of the deep learning disaggregated model includes:By each in history text
Sentence carries out word segmentation processing;Each word in history text described in indexation, obtains corresponding history glossarial index;By each
History glossarial index in sentence forms vector, indexes as the history sentence of each sentence;Believed according to the history text classification
Breath, carries out the training based on deep learning algorithm to history sentence index, obtains the deep learning disaggregated model.
In one embodiment, the sentence is split to form by the question mark in text, exclamation mark or fullstop.
In one embodiment, methods described also includes:The sentence is indexed and be input into respectively SVM classifier and Bayes classification
Device, to classify to the text, respectively obtains the corresponding tag value of svm classifier information and the text of the text
The corresponding tag value of Bayes classification informations;By the classification information of the text corresponding tag value, the text
The corresponding tag value of Bayes classification informations of the corresponding tag value of svm classifier information and the text gives weight respectively,
And the corresponding tag value of the classification information of final text is drawn according to weight.
In one embodiment, the corresponding tag value of the classification information by the text, the svm classifier of the text
The corresponding tag value of Bayes classification informations of the corresponding tag value of information and the text gives weight respectively, including:Give
Give the text the corresponding tag value of classification information weight more than the text the corresponding reference numerals of svm classifier information
The weight of value, the weight of the corresponding tag value of svm classifier information of the text are more than the Bayes classification informations of the text
The weight of corresponding tag value.
In one embodiment, the corresponding tag value of the classification information by the text, the svm classifier of the text
The corresponding tag value of Bayes classification informations of the corresponding tag value of information and the text gives weight respectively, including:It is right
The corresponding tag value of classification information of the text, the corresponding tag value of svm classifier information of the text and the text
This corresponding tag value of Bayes classification informations gives 0.5,0.3,0.2 weight respectively.
In one embodiment, each sentence in the text that will need to classify carries out word segmentation processing, including:Using
Reversely maximum matching algorithm or Viterbi algorithm carry out participle.
In one embodiment, before the word segmentation processing, also include:Remove the stop words in the text.
The present invention also provides a kind of device of text classification, and described device includes:Word-dividing mode, for classification will be needed
Each sentence in text carries out word segmentation processing;Indexation module, for each word in text described in indexation, obtains
The glossarial index of each sentence is synthesized index vector by corresponding glossarial index, is indexed as the sentence of each sentence;Classification mould
Block, by sentence index input deep learning disaggregated model, exports the depth of each sentence of deep learning disaggregated model output
The corresponding tag value of learning classification information;Statistical module, for carrying out number statistics to the tag value, and by number most
Deep learning classification information corresponding tag value of many tag values as the text, the deep learning classification information
Tag value of the corresponding tag value as the text classification information.
In one embodiment, the word-dividing mode, is additionally operable to for each sentence in history text to carry out word segmentation processing;
The indexation module, each word being additionally operable in history text described in indexation obtain corresponding history glossarial index;Will be every
History glossarial index in one sentence forms vector, indexes as the history sentence of each sentence;Described device also includes:Training
Module, for according to the history text classification information, carrying out the training based on deep learning algorithm to history sentence index,
Obtain the deep learning disaggregated model.
In one embodiment, described device also includes:Subordinate sentence module, for according to the question mark in text, exclamation mark or sentence
Number by text segmentation into sentence.
In one embodiment, the sort module also includes:Svm classifier unit divides for sentence index is input into SVM
Class device, and obtain the corresponding tag value of svm classifier information of the text;Bayes taxons, for the sentence is indexed
Input Bayes graders, and obtain the corresponding tag value of Bayes classification informations of the text;Described device also includes:Power
Molality block, for by the svm classifier information pair of the corresponding tag value of deep learning classification information of the text, the text
The corresponding tag value of the Bayes classification informations of tag value and the text answered gives weight respectively, and is obtained according to weight
Go out the corresponding tag value of classification information of final text.
In one embodiment, the weight of the corresponding tag value of classification information of the text that the weight module gives
More than the weight of the corresponding tag value of svm classifier information of the text, the corresponding labelling of svm classifier information of the text
Weight of the weight of numerical value more than the corresponding tag value of Bayes classification informations of the text.
In one embodiment, the weight module is additionally operable to:For the deep learning classification information correspondence by the text
Tag value, the corresponding tag value of svm classifier information of the text and the text Bayes classification informations it is corresponding
Tag value gives 0.5,0.3,0.2 weight respectively, and draws the corresponding labelling of the classification information of final text according to weight
Numerical value.
In one embodiment, the word-dividing mode also includes following one or two units:Reverse maximum match participle list
Unit, carries out participle using reverse maximum match segmentation;Viterbi participle units, carry out participle using Viterbi algorithm.
In one embodiment, described device also includes:Stop words removes module, for removing the deactivation in the text
Word.
As described above, the present invention is by the participle to text, glossarial indexization further obtains sentence index, the number for having sentence to index
Word represents, just can be classified with deep learning disaggregated model that the present invention is provided for the application of deep learning disaggregated model
Basis.And superiority of the depth learning technology in terms of content of text feature representation so that the classification results of the present invention are substantially excellent
In conventional art.
Description of the drawings
The flow chart that Fig. 1 shows one embodiment of method of text classification of the present invention;
The flow chart that Fig. 2 shows deep learning disaggregated model training;
Fig. 3 shows the schematic diagram of one embodiment of device of text classification of the present invention.
Specific embodiment
The present invention is first by the participle and indexation to sentence, and then draws the index of each sentence.With these sentences
Based on corresponding index number, such as SVM classifier and Bayes in deep learning disaggregated model and prior art just can be used
Grader is classified.The shallow semantic model of deep learning passes through iteration and BPTT (time reversal propagation) on text feature
Algorithm, carries out returning iterative modifications feature representation, the feature representation being optimal with this.Calculated with BPTT by iterating
Method carries out the modification of neural network weight, sets up feature based and expresses best disaggregated model, thus can obtain one it is excellent
In the model of traditional method.Meanwhile, the present invention also by way of setting weights, the existing sorting technique of integrated use and depth
Sorting technique is practised, the fault-tolerant ability of classification is further increased.
Fig. 1 is referred to, is the method flow diagram of text classification of the present invention, methods described includes:
101:Each sentence in the text that will need to classify carries out word segmentation processing;
102:Each word in text described in indexation, obtains corresponding glossarial index;
103:The glossarial index of each sentence is synthesized into index vector, is indexed as the sentence of each sentence;
104:By sentence index input deep learning disaggregated model, the deep learning classification information pair of each sentence is obtained
The tag value answered;
105:Carry out number statistics to the tag value, and using number most tag values as the text depth
The degree corresponding tag value of learning classification information, the corresponding tag value of the deep learning classification information is used as the text point
The tag value of category information.
Text is made up of sentence, and the most basic Component units of sentence are word, carry out step 101 first, it would be desirable to classify
Text in each sentence carry out word segmentation processing, the word that each obtains embody to a certain extent text classification category
Property.
Prior art includes various segmenting methods, mainly has maximum match method and Viterbi (Viterbi) algorithm.
Maximum match refers to dictionary that as foundation most long word is first scan string in taking dictionary, is swept in dictionary
Retouch (for improving sweep efficiency, can be how much to design multiple dictionaries according to number of words, then according to number of words respectively from different dictionaries
It is scanned).For example:In dictionary, most long word is " People's Republic of China (PRC) " totally 7 Chinese characters, then maximum match starting number of words is 7
Individual Chinese character.Then word for word successively decrease, make a look up in corresponding dictionary.
According to the difference of scanning direction, String matching segmenting method can be divided into positive matching and reverse matching;According to difference
The situation of length priority match, can be divided into maximum (most long) matching and minimum (most short) matching;According to whether with part-of-speech tagging
Process combines, and can be divided into the integral method that simple segmenting method and participle are combined with mark again.Conventional several machines
Tool segmenting method is as follows:
1) Forward Maximum Method method (by left-to-right direction);
2) reverse maximum matching method (by right to left direction);
3) minimum cutting (the word number cut out in making each sentence is minimum).
Further, it is also possible to above-mentioned various methods are mutually combined, for example, can by Forward Maximum Method method and it is reverse most
Big matching process combines composition bi-directional matching method.
And Viterbi algorithm solve be optimum state sequence in HMM (HMM) classical problem selection
Problem.Part-of-speech tagging problem is mapped to HMM and can be expressed as:In model, the number of state (part of speech) is part of speech
Number N of symbol;Number M of the number of the distinct symbols (word) that may be exported from each state for vocabulary.Assume in statistics
In meaning, the probability distribution of each part of speech is only with the part of speech of a upper word about (i.e. the two-dimensional grammar of part of speech), and each word
Probability distribution is only related to its part of speech.
In a preferred embodiment, sentence is split to form by the question mark in text, exclamation mark or fullstop.
To be used deep learning algorithm needs that the text with language expression, sentence and word are carried out numeral to represent, be convenient to
Directly with the degree of closeness of digital value, the degree of closeness of the classification stated by representation language.
In a step 102, each word in text described in indexation, obtains corresponding glossarial index.Herein, it is according to word
The order of appearance, gives successively from 1 beginning the index number for gradually increasing.
It should be noted that index number can be with from the beginning of other numerals, it would however also be possible to employ other outside numeral are appointed
Meaning symbol, which does not limit the scope of the invention.
As each sentence is made up of word, after the index for obtaining whole each word of text, step 103 is carried out,
The glossarial index of each sentence is synthesized into index vector, is indexed as the sentence of each sentence.
For example, some sentence in a text includes (A, B, C, D, E, F, G) after participle, wherein each letter
A word is represented, certain word is not limited only to letter, can be Chinese phrase, English word etc..1-7 works are respectively allocated for A-G
For index, index can also be matched somebody with somebody at random, rather than have to start cumulative distribution from 1.(A, B, C, D, E, F, G) word is included now
Sentence sentence index be [1 23456 7] this vector.
After sentence index is obtained, you can carry out step 104, sentence index input deep learning disaggregated model is obtained
The corresponding tag value of deep learning classification information of each sentence.
Tag value is looked first at, tag value is default when deep learning disaggregated model training is carried out, such as " 1 "
Physical culture class text is represented, " 2 " represent scientific and technological class text, and " 3 " represent literature and art class text, " 4 " military affairs class text etc..And will be used for instructing
Experienced history text is marked according to respective classification information, for example, all of sport category history text is collectively labeled as " 1 ",
All of history science and technology class text is collectively labeled as " 2 " etc..The deep learning model for obtaining is trained in this case just by text
This information and tag value are mapped.
There is the definition of tag value, continue the example above, for example, text has included sentence a:(A、B、C、D、E、
F, G), generally, text also includes other many sentences, for example, also include sentence b:(F, G, U, P) and sentence c:(M、H、
I, U, T, S), the corresponding sentence indexes of sentence b can be [6 7 17 18], and the corresponding sentence indexes of sentence c can be [504 505
506 17 508 509].Step 104 is exactly by [1 23456 7], [6 7 17 18] and [504 505 506 17 508
509] these three sentence index input deep learning disaggregated models, deep learning disaggregated model can correspondingly provide three sentence classification letters
Cease corresponding tag value.The tag value that for example sentence a is given be " 1 ", the tag value that sentence b is given be " 2 ", sentence c
The tag value for being given is " 2 ", that is to say, that sentence a is the sentence of sport category, and sentence b and sentence c is the sentence of scientific and technological class
Son.
In step 105, each corresponding tag value of sentence classification information of text is counted, and by number most
Deep learning classification information corresponding tag value of many tag values as the text, the deep learning classification information
Tag value of the corresponding tag value as the text classification information.
Then previous example, text tri- sentences of a total of a, b, c, wherein " 1 " this tag value is occurred in that 1 time, and
" 2 " this tag value occurs in that then the corresponding tag value of classification information of final text is that " 2 ", i.e. text are science and technology 2 times
Class text.
Deep learning disaggregated model is obtained by the training to history text, and training process refers to Fig. 2, whole to instruct
Practicing process is launched in the case of the classification for realizing knowing history text, and training process includes:
201:Each sentence in history text is carried out into word segmentation processing;
202:Each word in history text described in indexation, obtains corresponding history glossarial index;
203:History glossarial index in each sentence is formed into vector, is indexed as the history sentence of each sentence;
204:According to the history text classification information, the instruction based on deep learning algorithm is carried out to history sentence index
Practice, obtain the deep learning disaggregated model.
When disaggregated model training is carried out using depth learning technology, sentence index is carried out into Word Embedding (words first
It is embedded) process, each numerical value in each index vector is mapped in multi-C vector space, every number in vector is obtained
The multi-dimensional table of value reaches, and then obtains the Multidimensional Expressions of an index vector.
With regard to Word Embedding technologies, give one example, the word included in a sentence is " A B C D E F
G ", corresponding sentence index vector [1 23456 7], it is desirable to each the numerical value element in the vector is had one it is vectorial
Represent.Such as, it is vectorial for one of such [1 23456 7], each element is carried out into Word Embedding,
Can finally obtain:The corresponding vector of element " 1 " is [0.1 0.6-0.5], and the corresponding vector of element " 2 " is [- 0.2 0.9
0.7] etc..Why wish each index numerical value become one it is vectorial, purpose is still calculated for convenience, such as " seeks word A
Synonym ", it is possible to accomplished by " vector for asking multi-C vector corresponding with word A most like under COS distance ".
In a preferred embodiment, each index numerical value is expanded to into 4 dimensions or 128 by Word Embedding technologies
Dimension etc..
Step 105 in Fig. 1 is the process of a ballot.Specifically, although the classification letter of each sentence in text
Breath all can characterize the classification information of whole text to a certain extent, but stressing for each sentence also can be different, it is impossible to every
One sentence all can completely characterize the classification information of whole text.Obtaining the deep learning classification information correspondence of each sentence
Tag value after, carry out step 105, carry out number statistics to the tag value, and number most tag values is made
For the corresponding tag value of deep learning classification information of the text, the corresponding tag value of the deep learning classification information
As the tag value of the text classification information.
In another preferred embodiment of the present invention, while using SVM classifier of the prior art and Bayes graders
Text is classified, and finally weight is given respectively by deep learning classification results, svm classifier result and Bayes classification results,
Obtain final classification results.This embodiment has merged the sorting technique of prior art classification method and the present invention, can be abundant
Text classification is carried out using the advantage of different classifications method.
As deep learning classification has the advantages that itself, in a preferred embodiment, by the classification results of deep learning
Weight get the weight classified than svm classifier and Bayes will be high.On this basis, it is also possible to by the power of svm classifier result
It is configured to again higher than the weight of Bayes classification results.
In another embodiment, deep learning classification results, svm classifier result and Bayes classification results are given respectively
0.5th, 0.3,0.2 weight so that deep learning classification results can play decisive role.
It should be noted that in the present invention can classifying to text only with SVM classifier, or only with
Bayes graders are classified to text, or text are classified using other existing modes, or using other two
Existing mode more than kind is finally given to deep learning classification results and using prior art classification results again respectively to text classification
Weight is given, final classification results are obtained.
There is multiple technologies scheme participle to be carried out in can applying to step 101 in prior art, adopt in one embodiment
Reversely maximum matching algorithm or/and Viterbi algorithm carry out participle.
In one embodiment, before word segmentation processing, remove the stop words in the text.During so-called stop words refers to text
The frequency of occurrences is very high, but the again little word of practical significance, refers mainly to adverbial word, function word, modal particle etc..Such as "Yes", " but " etc..It is existing
Have in technology it is existing it is many stop that dictionary is available, need to only compare and stop dictionary the stop words in text is removed.
One specific example of methods described is as follows:
A) model training is carried out from search dog corpus;
B) 10 categorical datas of search dog corpus are carried out into participle, common 78w word;
C) glossarial index, carries out word frequency statisticses to this 78w word in this example, and sorts, after being sorted result and its
Ranking, using ranking as word call number.When word frequency is identical, word frequency identical multiple words are made into continuous random call number point
Match somebody with somebody.And reserved 780001 index.
D) sentence indexation, will per 150 continuous word used as a sentence, last sentence is if word is more than 150
Block, then according to text message is converted to digital information by being indexed of sentence by preceding method;
E) deep learning disaggregated model training, the sentence for obtaining index is carried out being augmented table with Word Embedding technologies
Show, be then augmented process higher-dimension represent be input to LSTM models in be trained, that is, obtain deep learning disaggregated model.
F) classifying text indexation is needed, index can be directly randomly assigned.Can also obtain after classifying text participle will be needed
Each word and search dog corpus in word make semantic matches, using the index for matching word in search dog corpus as need to divide
The index of word in class text, when matching word in search dog corpus, the index for needing the word of classifying text is arranged
For 780001.
D) obtain needing the sentence of classifying text to index with preceding method, the sentence for needing classifying text is indexed into input deep learning
Disaggregated model, obtains the classification information index of each sentence, then with voting mechanism, selects the most sentence classification letter of number
Breath index, that is, obtain the classification information of whole text.
The method that the present invention is provided is by participle operation, glossarial index operation, and then synthesis obtains an index vector, this sentence
The numeral expression of index so that be possibly realized with depth learning technology.Meanwhile, the utilization of depth learning technology so as to text
The character representation of this information has reached optimum, and then also more excellent compared with conventional art to the classification results of text.Further, this
It is bright in order to using existing svm classifier method and Bayes sorting techniques, while being entered using svm classifier method and Bayes sorting techniques
Row classification, and finally give the classification results that three class sorting techniques obtain and give weight, obtain the classification information of final text.
The above-mentioned file classification method of correspondence, the present invention also provides a kind of device of text classification, in one embodiment, described
Device refers to Fig. 3, each sentence for needing classifying text is carried out participle with word-dividing mode 301 first, in word-dividing mode 301
May include reverse maximum match participle unit and/or Viterbi participle units, respectively using reverse maximum match technology and/or
Viterbi technologies carry out participle.
In one embodiment, described device also includes subordinate sentence module, according to the question mark in text, exclamation mark or fullstop by text
Originally it is divided into sentence.
Then glossarial index is carried out with the good words of 302 pairs points of indexation module, equally, according to the order that word occurs, according to
It is secondary to give from 1 beginning the index number for gradually increasing.The glossarial index of each sentence is synthesized with indexation module 302 again
Index vector, indexes as the sentence of each sentence.Whole sentences are indexed into input sort module 303 respectively subsequently and obtains each sentence
The classification information of son, subsequently carries out number statistics with the tag value of 304 pairs of sentences of statistical module, and number is most
Tag value as text deep learning classification results.Meanwhile, whole sentences are indexed be input into respectively svm classifier module 305,
Bayes sort modules 306 carry out text classification calculating.
The principle of classification of Bayes's classification is the prior probability by certain object, calculates its posteriority using Bayesian formula
Probability, the i.e. object belong to the probability of a certain class, select the class with maximum a posteriori probability as the class belonging to the object.Also
It is to say, Bayes classifier is the optimization in minimal error rate meaning.Studying more Bayes classifier at present mainly has four
Kind, it is respectively:((tree enhancement mode naive Bayesian), BAN (greatly strengthen simple shellfish for Naive Bayes (naive Bayesian), TAN
Ye Si) with GBN (general Bayes networking).Can be classified using any one in the present invention.
The main thought of SVM may be summarized to be at 2 points:(1) it be linear can a point situation be analyzed, for linear
The sample of low-dimensional input space linearly inseparable is converted into higher-dimension by using non-linear map special by inseparable situation
Levying space makes its linear separability, so that high-dimensional feature space is carried out linearly to the nonlinear characteristic of sample using linear algorithm
Analysis is possibly realized;(2) it is based on structural risk minimization theory the construction optimum segmentation hyperplane in feature space, makes
Obtain learner and obtain global optimization, and the expected risk in whole sample space meets certain upper bound with certain probability.It is existing
There is existing packaged svm classifier module in technology, directly can be called.
Statistical module 304, svm classifier module 305, Bayes sort modules 306 calculate respective classification results respectively,
Then 304 classification results weight 1 of statistical module, 305 classification results weight 2, Bayes of svm classifier module classification mould are given respectively
306 classification results weight 3 of block obtains final text classification result.In one embodiment, the weight 1 that the weight module gives
More than weight 2, simultaneously also greater than weight 3, weight 2 is more than weight 3 to weight 1.
Weight 1 herein, weight 2, weight 3 can make appropriate adjustment according to actually used situation.In one embodiment, weigh
Weigh 1, weight 2, weight 3 and be respectively 0.5,0.3,0.2.
What sort module 303 was really classified to text using deep learning disaggregated model therein, and depth
Practise disaggregated model to obtain with history text training by training module.Correspondence preceding method, word-dividing mode 301 are additionally operable to
Each sentence in history text is carried out into word segmentation processing;Indexation module 302 is additionally operable to every in indexation history text
One word, obtains corresponding history glossarial index;History glossarial index in each sentence is formed into vector, as each sentence
History sentence index.History sentence index input training module is carried out into the training based on deep learning, deep learning is just obtained
Disaggregated model.
In one embodiment, described device also include stop words remove module, including it is some it is existing stop dictionary, right
Before text participle, the stop words in text is removed according to dictionary is stopped.
As described above, the present invention is by the participle to text, glossarial indexization further obtains sentence index, has digitized sentence
Index numerical value, so that it may deep learning classification information will be obtained in sentence index input deep learning disaggregated model.And deep learning point
Class model is also based on identical process, is obtained using the training of history text information.
For the advantage using existing svm classifier method and Bayes sorting techniques, at the same using svm classifier method and
Bayes sorting techniques are classified, and finally give the classification results that three class sorting techniques obtain and give weight, obtain final
The classification information of text, increased the fault-tolerant ability of sorting technique.
It is for so that any person skilled in the art can all make or use this public affairs to provide of this disclosure being previously described
Open.Various modifications of this disclosure all will be apparent for a person skilled in the art, and as defined herein general
Suitable principle can be applied to spirit or scope of other variants without departing from the disclosure.Thus, the disclosure is not intended to be limited
Due to example described herein and design, but should be awarded and principle disclosed herein and novel features phase one
The widest scope of cause.
Claims (16)
1. a kind of file classification method, it is characterised in that methods described includes:
Each sentence in the text that will need to classify carries out word segmentation processing;
Each word in text described in indexation, obtains corresponding glossarial index;
The glossarial index of each sentence is synthesized into index vector, is indexed as the sentence of each sentence;
By sentence index input deep learning disaggregated model, the corresponding labelling of deep learning classification information of each sentence is obtained
Numerical value;
Number statistics is carried out to the tag value, and is divided number most tag values as the deep learning of the text
The corresponding tag value of category information, the corresponding tag value of the deep learning classification information is used as the text classification information
Tag value.
2. the method for claim 1, it is characterised in that the training method of the deep learning disaggregated model includes:
Each sentence in history text is carried out into word segmentation processing;
Each word in history text described in indexation, obtains corresponding history glossarial index;
History glossarial index in each sentence is formed into vector, is indexed as the history sentence of each sentence;
According to the classification information of the history text, the training based on deep learning algorithm is carried out to history sentence index, is obtained
To the deep learning disaggregated model.
3. the method for claim 1, it is characterised in that the sentence is by the question mark in text, exclamation mark or fullstop
It is split to form.
4. the method for claim 1, it is characterised in that methods described also includes:
The sentence is indexed and be input into respectively SVM classifier and Bayes graders, to classify to the text, respectively obtained
The corresponding tag value of Bayes classification informations of the corresponding tag value of svm classifier information and the text of the text;
By the classification information of the text corresponding tag value, the corresponding tag value of svm classifier information of the text and
The corresponding tag value of Bayes classification informations of the text gives weight respectively, and draws dividing for final text according to weight
The corresponding tag value of category information.
5. method as claimed in claim 4, it is characterised in that the corresponding reference numerals of the classification information by the text
The corresponding tag value of Bayes classification informations of value, the corresponding tag value of svm classifier information of the text and the text
Weight is given respectively, including:
The weight for giving the corresponding tag value of classification information of the text is corresponding more than the svm classifier information of the text
The weight of tag value, the weight of the corresponding tag value of svm classifier information of the text are more than Bayes point of the text
The weight of the corresponding tag value of category information.
6. method as claimed in claim 5, it is characterised in that the corresponding reference numerals of the classification information by the text
The corresponding tag value of Bayes classification informations of value, the corresponding tag value of svm classifier information of the text and the text
Weight is given respectively, including:
The corresponding tag value of svm classifier information of the corresponding tag value of classification information, the text to the text and
The corresponding tag value of Bayes classification informations of the text gives 0.5,0.3,0.2 weight respectively.
7. the method for claim 1, it is characterised in that each sentence in the text that will need classification is carried out
Word segmentation processing, including:
Participle is carried out using reverse maximum matching algorithm or Viterbi algorithm.
8. the method for claim 1, it is characterised in that before the word segmentation processing, also include:
Remove the stop words in the text.
9. a kind of device of text classification, it is characterised in that described device includes:
Word-dividing mode, for each sentence in the text classified will be needed to carry out word segmentation processing;
Indexation module, for each word in text described in indexation, obtains corresponding glossarial index, by each sentence
Glossarial index synthesizes index vector, indexes as the sentence of each sentence;
Sentence index is input into deep learning disaggregated model by sort module, exports the output of deep learning disaggregated model each
The corresponding tag value of deep learning classification information of sentence;
Statistical module, for number statistics is carried out to the tag value, and using number most tag values as the text
This corresponding tag value of deep learning classification information, the corresponding tag value of the deep learning classification information is used as described
The tag value of text classification information.
10. device as claimed in claim 9, it is characterised in that
The word-dividing mode, is additionally operable to for each sentence in history text to carry out word segmentation processing;
The indexation module, each word being additionally operable in history text described in indexation obtain corresponding history glossarial index;
History glossarial index in each sentence is formed into vector, is indexed as the history sentence of each sentence;
Described device also includes:
Training module, for according to the history text classification information, carrying out calculating based on deep learning to history sentence index
The training of method, obtains the deep learning disaggregated model.
11. devices as claimed in claim 9, it is characterised in that described device also includes:
Subordinate sentence module, for according to the question mark in text, exclamation mark or fullstop by text segmentation into sentence.
12. devices as claimed in claim 9, it is characterised in that the sort module also includes:
Svm classifier unit, for sentence index is input into SVM classifier, and obtains the svm classifier information correspondence of the text
Tag value;
Bayes taxons, for sentence index is input into Bayes graders, and obtain the Bayes classification letters of the text
Cease corresponding tag value;
Described device also includes:
Weight module, for by the svm classifier of the corresponding tag value of deep learning classification information of the text, the text
The corresponding tag value of Bayes classification informations of the corresponding tag value of information and the text gives weight respectively, and according to
Weight draws the corresponding tag value of the classification information of final text.
13. devices as claimed in claim 12, it is characterised in that the classification information of the text that the weight module gives
Weight of the weight of corresponding tag value more than the corresponding tag value of svm classifier information of the text, the text
Power of the weight of the corresponding tag value of svm classifier information more than the corresponding tag value of Bayes classification informations of the text
Weight.
14. devices as claimed in claim 13, it is characterised in that the weight module is additionally operable to:
By the corresponding tag value of deep learning classification information of the text, the corresponding mark of svm classifier information of the text
The corresponding tag value of Bayes classification informations of counter value and the text gives 0.5,0.3,0.2 weight respectively, and according to
Weight draws the corresponding tag value of the classification information of final text.
15. devices as claimed in claim 9, it is characterised in that the word-dividing mode also includes following one or two units:
Reversely maximum match participle unit, carries out participle using reverse maximum match segmentation;
Viterbi participle units, carry out participle using Viterbi algorithm.
16. devices as claimed in claim 9, it is characterised in that described device also includes:
Stop words removes module, for removing the stop words in the text.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610976847.2A CN106528776A (en) | 2016-11-07 | 2016-11-07 | Text classification method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610976847.2A CN106528776A (en) | 2016-11-07 | 2016-11-07 | Text classification method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106528776A true CN106528776A (en) | 2017-03-22 |
Family
ID=58350249
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610976847.2A Pending CN106528776A (en) | 2016-11-07 | 2016-11-07 | Text classification method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106528776A (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107038152A (en) * | 2017-03-27 | 2017-08-11 | 成都优译信息技术股份有限公司 | Text punctuate method and system for drawing typesetting |
CN107403375A (en) * | 2017-04-19 | 2017-11-28 | 北京文因互联科技有限公司 | A kind of listed company's bulletin classification and abstraction generating method based on deep learning |
CN107657313A (en) * | 2017-09-26 | 2018-02-02 | 上海数眼科技发展有限公司 | The transfer learning system and method for the natural language processing task adapted to based on field |
CN108376287A (en) * | 2018-03-02 | 2018-08-07 | 复旦大学 | Multi-valued attribute segmenting device based on CN-DBpedia and method |
CN108829818A (en) * | 2018-06-12 | 2018-11-16 | 中国科学院计算技术研究所 | A kind of file classification method |
CN108875024A (en) * | 2018-06-20 | 2018-11-23 | 清华大学深圳研究生院 | File classification method, system, readable storage medium storing program for executing and electronic equipment |
CN109522407A (en) * | 2018-10-26 | 2019-03-26 | 平安科技(深圳)有限公司 | Business connection prediction technique, device, computer equipment and storage medium |
CN109858006A (en) * | 2017-11-30 | 2019-06-07 | 亿度慧达教育科技(北京)有限公司 | Subject recognition training method, apparatus |
CN110309251A (en) * | 2018-03-12 | 2019-10-08 | 北京京东尚科信息技术有限公司 | Processing method, device and the computer readable storage medium of text data |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103902570A (en) * | 2012-12-27 | 2014-07-02 | 腾讯科技(深圳)有限公司 | Text classification feature extraction method, classification method and device |
CN105243130A (en) * | 2015-09-29 | 2016-01-13 | 中国电子科技集团公司第三十二研究所 | Text processing system and method for data mining |
CN105930503A (en) * | 2016-05-09 | 2016-09-07 | 清华大学 | Combination feature vector and deep learning based sentiment classification method and device |
-
2016
- 2016-11-07 CN CN201610976847.2A patent/CN106528776A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103902570A (en) * | 2012-12-27 | 2014-07-02 | 腾讯科技(深圳)有限公司 | Text classification feature extraction method, classification method and device |
CN105243130A (en) * | 2015-09-29 | 2016-01-13 | 中国电子科技集团公司第三十二研究所 | Text processing system and method for data mining |
CN105930503A (en) * | 2016-05-09 | 2016-09-07 | 清华大学 | Combination feature vector and deep learning based sentiment classification method and device |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107038152A (en) * | 2017-03-27 | 2017-08-11 | 成都优译信息技术股份有限公司 | Text punctuate method and system for drawing typesetting |
CN107403375A (en) * | 2017-04-19 | 2017-11-28 | 北京文因互联科技有限公司 | A kind of listed company's bulletin classification and abstraction generating method based on deep learning |
CN107657313A (en) * | 2017-09-26 | 2018-02-02 | 上海数眼科技发展有限公司 | The transfer learning system and method for the natural language processing task adapted to based on field |
CN107657313B (en) * | 2017-09-26 | 2021-05-18 | 上海数眼科技发展有限公司 | System and method for transfer learning of natural language processing task based on field adaptation |
CN109858006B (en) * | 2017-11-30 | 2021-04-09 | 亿度慧达教育科技(北京)有限公司 | Subject identification training method and device |
CN109858006A (en) * | 2017-11-30 | 2019-06-07 | 亿度慧达教育科技(北京)有限公司 | Subject recognition training method, apparatus |
CN108376287A (en) * | 2018-03-02 | 2018-08-07 | 复旦大学 | Multi-valued attribute segmenting device based on CN-DBpedia and method |
CN110309251B (en) * | 2018-03-12 | 2024-01-12 | 北京京东尚科信息技术有限公司 | Text data processing method, device and computer readable storage medium |
CN110309251A (en) * | 2018-03-12 | 2019-10-08 | 北京京东尚科信息技术有限公司 | Processing method, device and the computer readable storage medium of text data |
CN108829818A (en) * | 2018-06-12 | 2018-11-16 | 中国科学院计算技术研究所 | A kind of file classification method |
CN108875024B (en) * | 2018-06-20 | 2020-10-20 | 清华大学深圳研究生院 | Text classification method and system, readable storage medium and electronic equipment |
CN108875024A (en) * | 2018-06-20 | 2018-11-23 | 清华大学深圳研究生院 | File classification method, system, readable storage medium storing program for executing and electronic equipment |
CN109522407A (en) * | 2018-10-26 | 2019-03-26 | 平安科技(深圳)有限公司 | Business connection prediction technique, device, computer equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106528776A (en) | Text classification method and device | |
CN111414479B (en) | Label extraction method based on short text clustering technology | |
CN108108351B (en) | Text emotion classification method based on deep learning combination model | |
CN109241255B (en) | Intention identification method based on deep learning | |
CN106294593B (en) | In conjunction with the Relation extraction method of subordinate clause grade remote supervisory and semi-supervised integrated study | |
EP2486470B1 (en) | System and method for inputting text into electronic devices | |
CN109635108B (en) | Man-machine interaction based remote supervision entity relationship extraction method | |
CN110851596A (en) | Text classification method and device and computer readable storage medium | |
CN108763539B (en) | Text classification method and system based on part-of-speech classification | |
CN109670041A (en) | A kind of band based on binary channels text convolutional neural networks is made an uproar illegal short text recognition methods | |
CN107451278A (en) | Chinese Text Categorization based on more hidden layer extreme learning machines | |
CN110674252A (en) | High-precision semantic search system for judicial domain | |
CN113505209A (en) | Intelligent question-answering system for automobile field | |
CN107220293B (en) | Emotion-based text classification method | |
CN108520038B (en) | Biomedical literature retrieval method based on sequencing learning algorithm | |
CN110717330A (en) | Word-sentence level short text classification method based on deep learning | |
CN112100365A (en) | Two-stage text summarization method | |
CN111177386A (en) | Proposal classification method and system | |
CN110910175A (en) | Tourist ticket product portrait generation method | |
CN107357895A (en) | A kind of processing method of the text representation based on bag of words | |
CN117454220A (en) | Data hierarchical classification method, device, equipment and storage medium | |
CN111460147A (en) | Title short text classification method based on semantic enhancement | |
CN112183106A (en) | Semantic understanding method and device based on phoneme association and deep learning | |
CN110826298A (en) | Statement coding method used in intelligent auxiliary password-fixing system | |
CN109800305A (en) | Based on the microblogging mood classification method marked naturally |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170322 |