CN106528776A - Text classification method and device - Google Patents

Text classification method and device Download PDF

Info

Publication number
CN106528776A
CN106528776A CN201610976847.2A CN201610976847A CN106528776A CN 106528776 A CN106528776 A CN 106528776A CN 201610976847 A CN201610976847 A CN 201610976847A CN 106528776 A CN106528776 A CN 106528776A
Authority
CN
China
Prior art keywords
text
sentence
tag value
corresponding tag
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610976847.2A
Other languages
Chinese (zh)
Inventor
贾祯
白杨
朱频频
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Zhizhen Intelligent Network Technology Co Ltd
Original Assignee
Shanghai Zhizhen Intelligent Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Zhizhen Intelligent Network Technology Co Ltd filed Critical Shanghai Zhizhen Intelligent Network Technology Co Ltd
Priority to CN201610976847.2A priority Critical patent/CN106528776A/en
Publication of CN106528776A publication Critical patent/CN106528776A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a text classification method. The method comprises the following steps of: carrying out word segmentation on each sentence in a to-be-classified text; indexing each word in the text to obtain corresponding word indexes; synthesizing the word indexes or each sentence into an index vector which is used as a sentence index of each sentence; inputting the sentence indexes into a deep learning classification model to obtain a mark numerical value corresponding to the deep learning classification information of each sentence; carrying out statistic on the numbers of the mark numerical values, and taking the mark numerical value with the maximum number as a mark numerical value corresponding to the deep learning classification information of the text, wherein the mark numerical value corresponding to the deep learning classification information is used as a mark numerical value of text classification information. The invention furthermore provides a device corresponding to the method.

Description

A kind of method and apparatus of text classification
Technical field
The present invention relates to a kind of method and apparatus of text classification, more particularly to one kind enters style of writing using depth learning technology The method and apparatus of this classification.
Background technology
During natural language processing, base of the text classification as applications such as classifying content, sentiment analysis, topic identifications Plinth, it appears particularly important.
Text classification is first had to according to content of text participle, and participle is converted into vector representation, and prior art includes SVM (support vector machine), Logistics (logic), RandomForest (random forest), Bayes (Bayes), KNN (most faces Closely).SVM, Logistics, RandomForest are the higher-dimension discrimination models of word-based vector mode, feature is relied on and is compared By force.The Bayesian model of Bayes, KNN based on statistical, higher-dimension discrimination model subject matter are that vector table reaches and above cannot The semantic information of complete text is characterized, and the decision boundary of Bayesian model determines it is extremely difficult.
The content of the invention
It is an object of the invention to provide a kind of method and apparatus of text classification, with depth learning technology, can be right The participle of text is carried out than the more preferable feature representation of conventional art, and then can be obtained than the more preferable classification results of conventional art.
According to above-mentioned purpose, the present invention provides a kind of file classification method, and methods described includes:The text of classification will be needed In each sentence carry out word segmentation processing;Each word in text described in indexation, obtains corresponding glossarial index;Will be each The glossarial index of individual sentence synthesizes index vector, indexes as the sentence of each sentence;By sentence index input deep learning Disaggregated model, obtains the corresponding tag value of deep learning classification information of each sentence;Number is carried out to the tag value Statistics, and using number most tag values as the text the corresponding tag value of deep learning classification information, it is described Tag value of the corresponding tag value of deep learning classification information as the text classification information.
In one embodiment, the training method of the deep learning disaggregated model includes:By each in history text Sentence carries out word segmentation processing;Each word in history text described in indexation, obtains corresponding history glossarial index;By each History glossarial index in sentence forms vector, indexes as the history sentence of each sentence;Believed according to the history text classification Breath, carries out the training based on deep learning algorithm to history sentence index, obtains the deep learning disaggregated model.
In one embodiment, the sentence is split to form by the question mark in text, exclamation mark or fullstop.
In one embodiment, methods described also includes:The sentence is indexed and be input into respectively SVM classifier and Bayes classification Device, to classify to the text, respectively obtains the corresponding tag value of svm classifier information and the text of the text The corresponding tag value of Bayes classification informations;By the classification information of the text corresponding tag value, the text The corresponding tag value of Bayes classification informations of the corresponding tag value of svm classifier information and the text gives weight respectively, And the corresponding tag value of the classification information of final text is drawn according to weight.
In one embodiment, the corresponding tag value of the classification information by the text, the svm classifier of the text The corresponding tag value of Bayes classification informations of the corresponding tag value of information and the text gives weight respectively, including:Give Give the text the corresponding tag value of classification information weight more than the text the corresponding reference numerals of svm classifier information The weight of value, the weight of the corresponding tag value of svm classifier information of the text are more than the Bayes classification informations of the text The weight of corresponding tag value.
In one embodiment, the corresponding tag value of the classification information by the text, the svm classifier of the text The corresponding tag value of Bayes classification informations of the corresponding tag value of information and the text gives weight respectively, including:It is right The corresponding tag value of classification information of the text, the corresponding tag value of svm classifier information of the text and the text This corresponding tag value of Bayes classification informations gives 0.5,0.3,0.2 weight respectively.
In one embodiment, each sentence in the text that will need to classify carries out word segmentation processing, including:Using Reversely maximum matching algorithm or Viterbi algorithm carry out participle.
In one embodiment, before the word segmentation processing, also include:Remove the stop words in the text.
The present invention also provides a kind of device of text classification, and described device includes:Word-dividing mode, for classification will be needed Each sentence in text carries out word segmentation processing;Indexation module, for each word in text described in indexation, obtains The glossarial index of each sentence is synthesized index vector by corresponding glossarial index, is indexed as the sentence of each sentence;Classification mould Block, by sentence index input deep learning disaggregated model, exports the depth of each sentence of deep learning disaggregated model output The corresponding tag value of learning classification information;Statistical module, for carrying out number statistics to the tag value, and by number most Deep learning classification information corresponding tag value of many tag values as the text, the deep learning classification information Tag value of the corresponding tag value as the text classification information.
In one embodiment, the word-dividing mode, is additionally operable to for each sentence in history text to carry out word segmentation processing; The indexation module, each word being additionally operable in history text described in indexation obtain corresponding history glossarial index;Will be every History glossarial index in one sentence forms vector, indexes as the history sentence of each sentence;Described device also includes:Training Module, for according to the history text classification information, carrying out the training based on deep learning algorithm to history sentence index, Obtain the deep learning disaggregated model.
In one embodiment, described device also includes:Subordinate sentence module, for according to the question mark in text, exclamation mark or sentence Number by text segmentation into sentence.
In one embodiment, the sort module also includes:Svm classifier unit divides for sentence index is input into SVM Class device, and obtain the corresponding tag value of svm classifier information of the text;Bayes taxons, for the sentence is indexed Input Bayes graders, and obtain the corresponding tag value of Bayes classification informations of the text;Described device also includes:Power Molality block, for by the svm classifier information pair of the corresponding tag value of deep learning classification information of the text, the text The corresponding tag value of the Bayes classification informations of tag value and the text answered gives weight respectively, and is obtained according to weight Go out the corresponding tag value of classification information of final text.
In one embodiment, the weight of the corresponding tag value of classification information of the text that the weight module gives More than the weight of the corresponding tag value of svm classifier information of the text, the corresponding labelling of svm classifier information of the text Weight of the weight of numerical value more than the corresponding tag value of Bayes classification informations of the text.
In one embodiment, the weight module is additionally operable to:For the deep learning classification information correspondence by the text Tag value, the corresponding tag value of svm classifier information of the text and the text Bayes classification informations it is corresponding Tag value gives 0.5,0.3,0.2 weight respectively, and draws the corresponding labelling of the classification information of final text according to weight Numerical value.
In one embodiment, the word-dividing mode also includes following one or two units:Reverse maximum match participle list Unit, carries out participle using reverse maximum match segmentation;Viterbi participle units, carry out participle using Viterbi algorithm.
In one embodiment, described device also includes:Stop words removes module, for removing the deactivation in the text Word.
As described above, the present invention is by the participle to text, glossarial indexization further obtains sentence index, the number for having sentence to index Word represents, just can be classified with deep learning disaggregated model that the present invention is provided for the application of deep learning disaggregated model Basis.And superiority of the depth learning technology in terms of content of text feature representation so that the classification results of the present invention are substantially excellent In conventional art.
Description of the drawings
The flow chart that Fig. 1 shows one embodiment of method of text classification of the present invention;
The flow chart that Fig. 2 shows deep learning disaggregated model training;
Fig. 3 shows the schematic diagram of one embodiment of device of text classification of the present invention.
Specific embodiment
The present invention is first by the participle and indexation to sentence, and then draws the index of each sentence.With these sentences Based on corresponding index number, such as SVM classifier and Bayes in deep learning disaggregated model and prior art just can be used Grader is classified.The shallow semantic model of deep learning passes through iteration and BPTT (time reversal propagation) on text feature Algorithm, carries out returning iterative modifications feature representation, the feature representation being optimal with this.Calculated with BPTT by iterating Method carries out the modification of neural network weight, sets up feature based and expresses best disaggregated model, thus can obtain one it is excellent In the model of traditional method.Meanwhile, the present invention also by way of setting weights, the existing sorting technique of integrated use and depth Sorting technique is practised, the fault-tolerant ability of classification is further increased.
Fig. 1 is referred to, is the method flow diagram of text classification of the present invention, methods described includes:
101:Each sentence in the text that will need to classify carries out word segmentation processing;
102:Each word in text described in indexation, obtains corresponding glossarial index;
103:The glossarial index of each sentence is synthesized into index vector, is indexed as the sentence of each sentence;
104:By sentence index input deep learning disaggregated model, the deep learning classification information pair of each sentence is obtained The tag value answered;
105:Carry out number statistics to the tag value, and using number most tag values as the text depth The degree corresponding tag value of learning classification information, the corresponding tag value of the deep learning classification information is used as the text point The tag value of category information.
Text is made up of sentence, and the most basic Component units of sentence are word, carry out step 101 first, it would be desirable to classify Text in each sentence carry out word segmentation processing, the word that each obtains embody to a certain extent text classification category Property.
Prior art includes various segmenting methods, mainly has maximum match method and Viterbi (Viterbi) algorithm.
Maximum match refers to dictionary that as foundation most long word is first scan string in taking dictionary, is swept in dictionary Retouch (for improving sweep efficiency, can be how much to design multiple dictionaries according to number of words, then according to number of words respectively from different dictionaries It is scanned).For example:In dictionary, most long word is " People's Republic of China (PRC) " totally 7 Chinese characters, then maximum match starting number of words is 7 Individual Chinese character.Then word for word successively decrease, make a look up in corresponding dictionary.
According to the difference of scanning direction, String matching segmenting method can be divided into positive matching and reverse matching;According to difference The situation of length priority match, can be divided into maximum (most long) matching and minimum (most short) matching;According to whether with part-of-speech tagging Process combines, and can be divided into the integral method that simple segmenting method and participle are combined with mark again.Conventional several machines Tool segmenting method is as follows:
1) Forward Maximum Method method (by left-to-right direction);
2) reverse maximum matching method (by right to left direction);
3) minimum cutting (the word number cut out in making each sentence is minimum).
Further, it is also possible to above-mentioned various methods are mutually combined, for example, can by Forward Maximum Method method and it is reverse most Big matching process combines composition bi-directional matching method.
And Viterbi algorithm solve be optimum state sequence in HMM (HMM) classical problem selection Problem.Part-of-speech tagging problem is mapped to HMM and can be expressed as:In model, the number of state (part of speech) is part of speech Number N of symbol;Number M of the number of the distinct symbols (word) that may be exported from each state for vocabulary.Assume in statistics In meaning, the probability distribution of each part of speech is only with the part of speech of a upper word about (i.e. the two-dimensional grammar of part of speech), and each word Probability distribution is only related to its part of speech.
In a preferred embodiment, sentence is split to form by the question mark in text, exclamation mark or fullstop.
To be used deep learning algorithm needs that the text with language expression, sentence and word are carried out numeral to represent, be convenient to Directly with the degree of closeness of digital value, the degree of closeness of the classification stated by representation language.
In a step 102, each word in text described in indexation, obtains corresponding glossarial index.Herein, it is according to word The order of appearance, gives successively from 1 beginning the index number for gradually increasing.
It should be noted that index number can be with from the beginning of other numerals, it would however also be possible to employ other outside numeral are appointed Meaning symbol, which does not limit the scope of the invention.
As each sentence is made up of word, after the index for obtaining whole each word of text, step 103 is carried out, The glossarial index of each sentence is synthesized into index vector, is indexed as the sentence of each sentence.
For example, some sentence in a text includes (A, B, C, D, E, F, G) after participle, wherein each letter A word is represented, certain word is not limited only to letter, can be Chinese phrase, English word etc..1-7 works are respectively allocated for A-G For index, index can also be matched somebody with somebody at random, rather than have to start cumulative distribution from 1.(A, B, C, D, E, F, G) word is included now Sentence sentence index be [1 23456 7] this vector.
After sentence index is obtained, you can carry out step 104, sentence index input deep learning disaggregated model is obtained The corresponding tag value of deep learning classification information of each sentence.
Tag value is looked first at, tag value is default when deep learning disaggregated model training is carried out, such as " 1 " Physical culture class text is represented, " 2 " represent scientific and technological class text, and " 3 " represent literature and art class text, " 4 " military affairs class text etc..And will be used for instructing Experienced history text is marked according to respective classification information, for example, all of sport category history text is collectively labeled as " 1 ", All of history science and technology class text is collectively labeled as " 2 " etc..The deep learning model for obtaining is trained in this case just by text This information and tag value are mapped.
There is the definition of tag value, continue the example above, for example, text has included sentence a:(A、B、C、D、E、 F, G), generally, text also includes other many sentences, for example, also include sentence b:(F, G, U, P) and sentence c:(M、H、 I, U, T, S), the corresponding sentence indexes of sentence b can be [6 7 17 18], and the corresponding sentence indexes of sentence c can be [504 505 506 17 508 509].Step 104 is exactly by [1 23456 7], [6 7 17 18] and [504 505 506 17 508 509] these three sentence index input deep learning disaggregated models, deep learning disaggregated model can correspondingly provide three sentence classification letters Cease corresponding tag value.The tag value that for example sentence a is given be " 1 ", the tag value that sentence b is given be " 2 ", sentence c The tag value for being given is " 2 ", that is to say, that sentence a is the sentence of sport category, and sentence b and sentence c is the sentence of scientific and technological class Son.
In step 105, each corresponding tag value of sentence classification information of text is counted, and by number most Deep learning classification information corresponding tag value of many tag values as the text, the deep learning classification information Tag value of the corresponding tag value as the text classification information.
Then previous example, text tri- sentences of a total of a, b, c, wherein " 1 " this tag value is occurred in that 1 time, and " 2 " this tag value occurs in that then the corresponding tag value of classification information of final text is that " 2 ", i.e. text are science and technology 2 times Class text.
Deep learning disaggregated model is obtained by the training to history text, and training process refers to Fig. 2, whole to instruct Practicing process is launched in the case of the classification for realizing knowing history text, and training process includes:
201:Each sentence in history text is carried out into word segmentation processing;
202:Each word in history text described in indexation, obtains corresponding history glossarial index;
203:History glossarial index in each sentence is formed into vector, is indexed as the history sentence of each sentence;
204:According to the history text classification information, the instruction based on deep learning algorithm is carried out to history sentence index Practice, obtain the deep learning disaggregated model.
When disaggregated model training is carried out using depth learning technology, sentence index is carried out into Word Embedding (words first It is embedded) process, each numerical value in each index vector is mapped in multi-C vector space, every number in vector is obtained The multi-dimensional table of value reaches, and then obtains the Multidimensional Expressions of an index vector.
With regard to Word Embedding technologies, give one example, the word included in a sentence is " A B C D E F G ", corresponding sentence index vector [1 23456 7], it is desirable to each the numerical value element in the vector is had one it is vectorial Represent.Such as, it is vectorial for one of such [1 23456 7], each element is carried out into Word Embedding, Can finally obtain:The corresponding vector of element " 1 " is [0.1 0.6-0.5], and the corresponding vector of element " 2 " is [- 0.2 0.9 0.7] etc..Why wish each index numerical value become one it is vectorial, purpose is still calculated for convenience, such as " seeks word A Synonym ", it is possible to accomplished by " vector for asking multi-C vector corresponding with word A most like under COS distance ".
In a preferred embodiment, each index numerical value is expanded to into 4 dimensions or 128 by Word Embedding technologies Dimension etc..
Step 105 in Fig. 1 is the process of a ballot.Specifically, although the classification letter of each sentence in text Breath all can characterize the classification information of whole text to a certain extent, but stressing for each sentence also can be different, it is impossible to every One sentence all can completely characterize the classification information of whole text.Obtaining the deep learning classification information correspondence of each sentence Tag value after, carry out step 105, carry out number statistics to the tag value, and number most tag values is made For the corresponding tag value of deep learning classification information of the text, the corresponding tag value of the deep learning classification information As the tag value of the text classification information.
In another preferred embodiment of the present invention, while using SVM classifier of the prior art and Bayes graders Text is classified, and finally weight is given respectively by deep learning classification results, svm classifier result and Bayes classification results, Obtain final classification results.This embodiment has merged the sorting technique of prior art classification method and the present invention, can be abundant Text classification is carried out using the advantage of different classifications method.
As deep learning classification has the advantages that itself, in a preferred embodiment, by the classification results of deep learning Weight get the weight classified than svm classifier and Bayes will be high.On this basis, it is also possible to by the power of svm classifier result It is configured to again higher than the weight of Bayes classification results.
In another embodiment, deep learning classification results, svm classifier result and Bayes classification results are given respectively 0.5th, 0.3,0.2 weight so that deep learning classification results can play decisive role.
It should be noted that in the present invention can classifying to text only with SVM classifier, or only with Bayes graders are classified to text, or text are classified using other existing modes, or using other two Existing mode more than kind is finally given to deep learning classification results and using prior art classification results again respectively to text classification Weight is given, final classification results are obtained.
There is multiple technologies scheme participle to be carried out in can applying to step 101 in prior art, adopt in one embodiment Reversely maximum matching algorithm or/and Viterbi algorithm carry out participle.
In one embodiment, before word segmentation processing, remove the stop words in the text.During so-called stop words refers to text The frequency of occurrences is very high, but the again little word of practical significance, refers mainly to adverbial word, function word, modal particle etc..Such as "Yes", " but " etc..It is existing Have in technology it is existing it is many stop that dictionary is available, need to only compare and stop dictionary the stop words in text is removed.
One specific example of methods described is as follows:
A) model training is carried out from search dog corpus;
B) 10 categorical datas of search dog corpus are carried out into participle, common 78w word;
C) glossarial index, carries out word frequency statisticses to this 78w word in this example, and sorts, after being sorted result and its Ranking, using ranking as word call number.When word frequency is identical, word frequency identical multiple words are made into continuous random call number point Match somebody with somebody.And reserved 780001 index.
D) sentence indexation, will per 150 continuous word used as a sentence, last sentence is if word is more than 150 Block, then according to text message is converted to digital information by being indexed of sentence by preceding method;
E) deep learning disaggregated model training, the sentence for obtaining index is carried out being augmented table with Word Embedding technologies Show, be then augmented process higher-dimension represent be input to LSTM models in be trained, that is, obtain deep learning disaggregated model.
F) classifying text indexation is needed, index can be directly randomly assigned.Can also obtain after classifying text participle will be needed Each word and search dog corpus in word make semantic matches, using the index for matching word in search dog corpus as need to divide The index of word in class text, when matching word in search dog corpus, the index for needing the word of classifying text is arranged For 780001.
D) obtain needing the sentence of classifying text to index with preceding method, the sentence for needing classifying text is indexed into input deep learning Disaggregated model, obtains the classification information index of each sentence, then with voting mechanism, selects the most sentence classification letter of number Breath index, that is, obtain the classification information of whole text.
The method that the present invention is provided is by participle operation, glossarial index operation, and then synthesis obtains an index vector, this sentence The numeral expression of index so that be possibly realized with depth learning technology.Meanwhile, the utilization of depth learning technology so as to text The character representation of this information has reached optimum, and then also more excellent compared with conventional art to the classification results of text.Further, this It is bright in order to using existing svm classifier method and Bayes sorting techniques, while being entered using svm classifier method and Bayes sorting techniques Row classification, and finally give the classification results that three class sorting techniques obtain and give weight, obtain the classification information of final text.
The above-mentioned file classification method of correspondence, the present invention also provides a kind of device of text classification, in one embodiment, described Device refers to Fig. 3, each sentence for needing classifying text is carried out participle with word-dividing mode 301 first, in word-dividing mode 301 May include reverse maximum match participle unit and/or Viterbi participle units, respectively using reverse maximum match technology and/or Viterbi technologies carry out participle.
In one embodiment, described device also includes subordinate sentence module, according to the question mark in text, exclamation mark or fullstop by text Originally it is divided into sentence.
Then glossarial index is carried out with the good words of 302 pairs points of indexation module, equally, according to the order that word occurs, according to It is secondary to give from 1 beginning the index number for gradually increasing.The glossarial index of each sentence is synthesized with indexation module 302 again Index vector, indexes as the sentence of each sentence.Whole sentences are indexed into input sort module 303 respectively subsequently and obtains each sentence The classification information of son, subsequently carries out number statistics with the tag value of 304 pairs of sentences of statistical module, and number is most Tag value as text deep learning classification results.Meanwhile, whole sentences are indexed be input into respectively svm classifier module 305, Bayes sort modules 306 carry out text classification calculating.
The principle of classification of Bayes's classification is the prior probability by certain object, calculates its posteriority using Bayesian formula Probability, the i.e. object belong to the probability of a certain class, select the class with maximum a posteriori probability as the class belonging to the object.Also It is to say, Bayes classifier is the optimization in minimal error rate meaning.Studying more Bayes classifier at present mainly has four Kind, it is respectively:((tree enhancement mode naive Bayesian), BAN (greatly strengthen simple shellfish for Naive Bayes (naive Bayesian), TAN Ye Si) with GBN (general Bayes networking).Can be classified using any one in the present invention.
The main thought of SVM may be summarized to be at 2 points:(1) it be linear can a point situation be analyzed, for linear The sample of low-dimensional input space linearly inseparable is converted into higher-dimension by using non-linear map special by inseparable situation Levying space makes its linear separability, so that high-dimensional feature space is carried out linearly to the nonlinear characteristic of sample using linear algorithm Analysis is possibly realized;(2) it is based on structural risk minimization theory the construction optimum segmentation hyperplane in feature space, makes Obtain learner and obtain global optimization, and the expected risk in whole sample space meets certain upper bound with certain probability.It is existing There is existing packaged svm classifier module in technology, directly can be called.
Statistical module 304, svm classifier module 305, Bayes sort modules 306 calculate respective classification results respectively, Then 304 classification results weight 1 of statistical module, 305 classification results weight 2, Bayes of svm classifier module classification mould are given respectively 306 classification results weight 3 of block obtains final text classification result.In one embodiment, the weight 1 that the weight module gives More than weight 2, simultaneously also greater than weight 3, weight 2 is more than weight 3 to weight 1.
Weight 1 herein, weight 2, weight 3 can make appropriate adjustment according to actually used situation.In one embodiment, weigh Weigh 1, weight 2, weight 3 and be respectively 0.5,0.3,0.2.
What sort module 303 was really classified to text using deep learning disaggregated model therein, and depth Practise disaggregated model to obtain with history text training by training module.Correspondence preceding method, word-dividing mode 301 are additionally operable to Each sentence in history text is carried out into word segmentation processing;Indexation module 302 is additionally operable to every in indexation history text One word, obtains corresponding history glossarial index;History glossarial index in each sentence is formed into vector, as each sentence History sentence index.History sentence index input training module is carried out into the training based on deep learning, deep learning is just obtained Disaggregated model.
In one embodiment, described device also include stop words remove module, including it is some it is existing stop dictionary, right Before text participle, the stop words in text is removed according to dictionary is stopped.
As described above, the present invention is by the participle to text, glossarial indexization further obtains sentence index, has digitized sentence Index numerical value, so that it may deep learning classification information will be obtained in sentence index input deep learning disaggregated model.And deep learning point Class model is also based on identical process, is obtained using the training of history text information.
For the advantage using existing svm classifier method and Bayes sorting techniques, at the same using svm classifier method and Bayes sorting techniques are classified, and finally give the classification results that three class sorting techniques obtain and give weight, obtain final The classification information of text, increased the fault-tolerant ability of sorting technique.
It is for so that any person skilled in the art can all make or use this public affairs to provide of this disclosure being previously described Open.Various modifications of this disclosure all will be apparent for a person skilled in the art, and as defined herein general Suitable principle can be applied to spirit or scope of other variants without departing from the disclosure.Thus, the disclosure is not intended to be limited Due to example described herein and design, but should be awarded and principle disclosed herein and novel features phase one The widest scope of cause.

Claims (16)

1. a kind of file classification method, it is characterised in that methods described includes:
Each sentence in the text that will need to classify carries out word segmentation processing;
Each word in text described in indexation, obtains corresponding glossarial index;
The glossarial index of each sentence is synthesized into index vector, is indexed as the sentence of each sentence;
By sentence index input deep learning disaggregated model, the corresponding labelling of deep learning classification information of each sentence is obtained Numerical value;
Number statistics is carried out to the tag value, and is divided number most tag values as the deep learning of the text The corresponding tag value of category information, the corresponding tag value of the deep learning classification information is used as the text classification information Tag value.
2. the method for claim 1, it is characterised in that the training method of the deep learning disaggregated model includes:
Each sentence in history text is carried out into word segmentation processing;
Each word in history text described in indexation, obtains corresponding history glossarial index;
History glossarial index in each sentence is formed into vector, is indexed as the history sentence of each sentence;
According to the classification information of the history text, the training based on deep learning algorithm is carried out to history sentence index, is obtained To the deep learning disaggregated model.
3. the method for claim 1, it is characterised in that the sentence is by the question mark in text, exclamation mark or fullstop It is split to form.
4. the method for claim 1, it is characterised in that methods described also includes:
The sentence is indexed and be input into respectively SVM classifier and Bayes graders, to classify to the text, respectively obtained The corresponding tag value of Bayes classification informations of the corresponding tag value of svm classifier information and the text of the text;
By the classification information of the text corresponding tag value, the corresponding tag value of svm classifier information of the text and The corresponding tag value of Bayes classification informations of the text gives weight respectively, and draws dividing for final text according to weight The corresponding tag value of category information.
5. method as claimed in claim 4, it is characterised in that the corresponding reference numerals of the classification information by the text The corresponding tag value of Bayes classification informations of value, the corresponding tag value of svm classifier information of the text and the text Weight is given respectively, including:
The weight for giving the corresponding tag value of classification information of the text is corresponding more than the svm classifier information of the text The weight of tag value, the weight of the corresponding tag value of svm classifier information of the text are more than Bayes point of the text The weight of the corresponding tag value of category information.
6. method as claimed in claim 5, it is characterised in that the corresponding reference numerals of the classification information by the text The corresponding tag value of Bayes classification informations of value, the corresponding tag value of svm classifier information of the text and the text Weight is given respectively, including:
The corresponding tag value of svm classifier information of the corresponding tag value of classification information, the text to the text and The corresponding tag value of Bayes classification informations of the text gives 0.5,0.3,0.2 weight respectively.
7. the method for claim 1, it is characterised in that each sentence in the text that will need classification is carried out Word segmentation processing, including:
Participle is carried out using reverse maximum matching algorithm or Viterbi algorithm.
8. the method for claim 1, it is characterised in that before the word segmentation processing, also include:
Remove the stop words in the text.
9. a kind of device of text classification, it is characterised in that described device includes:
Word-dividing mode, for each sentence in the text classified will be needed to carry out word segmentation processing;
Indexation module, for each word in text described in indexation, obtains corresponding glossarial index, by each sentence Glossarial index synthesizes index vector, indexes as the sentence of each sentence;
Sentence index is input into deep learning disaggregated model by sort module, exports the output of deep learning disaggregated model each The corresponding tag value of deep learning classification information of sentence;
Statistical module, for number statistics is carried out to the tag value, and using number most tag values as the text This corresponding tag value of deep learning classification information, the corresponding tag value of the deep learning classification information is used as described The tag value of text classification information.
10. device as claimed in claim 9, it is characterised in that
The word-dividing mode, is additionally operable to for each sentence in history text to carry out word segmentation processing;
The indexation module, each word being additionally operable in history text described in indexation obtain corresponding history glossarial index; History glossarial index in each sentence is formed into vector, is indexed as the history sentence of each sentence;
Described device also includes:
Training module, for according to the history text classification information, carrying out calculating based on deep learning to history sentence index The training of method, obtains the deep learning disaggregated model.
11. devices as claimed in claim 9, it is characterised in that described device also includes:
Subordinate sentence module, for according to the question mark in text, exclamation mark or fullstop by text segmentation into sentence.
12. devices as claimed in claim 9, it is characterised in that the sort module also includes:
Svm classifier unit, for sentence index is input into SVM classifier, and obtains the svm classifier information correspondence of the text Tag value;
Bayes taxons, for sentence index is input into Bayes graders, and obtain the Bayes classification letters of the text Cease corresponding tag value;
Described device also includes:
Weight module, for by the svm classifier of the corresponding tag value of deep learning classification information of the text, the text The corresponding tag value of Bayes classification informations of the corresponding tag value of information and the text gives weight respectively, and according to Weight draws the corresponding tag value of the classification information of final text.
13. devices as claimed in claim 12, it is characterised in that the classification information of the text that the weight module gives Weight of the weight of corresponding tag value more than the corresponding tag value of svm classifier information of the text, the text Power of the weight of the corresponding tag value of svm classifier information more than the corresponding tag value of Bayes classification informations of the text Weight.
14. devices as claimed in claim 13, it is characterised in that the weight module is additionally operable to:
By the corresponding tag value of deep learning classification information of the text, the corresponding mark of svm classifier information of the text The corresponding tag value of Bayes classification informations of counter value and the text gives 0.5,0.3,0.2 weight respectively, and according to Weight draws the corresponding tag value of the classification information of final text.
15. devices as claimed in claim 9, it is characterised in that the word-dividing mode also includes following one or two units:
Reversely maximum match participle unit, carries out participle using reverse maximum match segmentation;
Viterbi participle units, carry out participle using Viterbi algorithm.
16. devices as claimed in claim 9, it is characterised in that described device also includes:
Stop words removes module, for removing the stop words in the text.
CN201610976847.2A 2016-11-07 2016-11-07 Text classification method and device Pending CN106528776A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610976847.2A CN106528776A (en) 2016-11-07 2016-11-07 Text classification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610976847.2A CN106528776A (en) 2016-11-07 2016-11-07 Text classification method and device

Publications (1)

Publication Number Publication Date
CN106528776A true CN106528776A (en) 2017-03-22

Family

ID=58350249

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610976847.2A Pending CN106528776A (en) 2016-11-07 2016-11-07 Text classification method and device

Country Status (1)

Country Link
CN (1) CN106528776A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107038152A (en) * 2017-03-27 2017-08-11 成都优译信息技术股份有限公司 Text punctuate method and system for drawing typesetting
CN107403375A (en) * 2017-04-19 2017-11-28 北京文因互联科技有限公司 A kind of listed company's bulletin classification and abstraction generating method based on deep learning
CN107657313A (en) * 2017-09-26 2018-02-02 上海数眼科技发展有限公司 The transfer learning system and method for the natural language processing task adapted to based on field
CN108376287A (en) * 2018-03-02 2018-08-07 复旦大学 Multi-valued attribute segmenting device based on CN-DBpedia and method
CN108829818A (en) * 2018-06-12 2018-11-16 中国科学院计算技术研究所 A kind of file classification method
CN108875024A (en) * 2018-06-20 2018-11-23 清华大学深圳研究生院 File classification method, system, readable storage medium storing program for executing and electronic equipment
CN109522407A (en) * 2018-10-26 2019-03-26 平安科技(深圳)有限公司 Business connection prediction technique, device, computer equipment and storage medium
CN109858006A (en) * 2017-11-30 2019-06-07 亿度慧达教育科技(北京)有限公司 Subject recognition training method, apparatus
CN110309251A (en) * 2018-03-12 2019-10-08 北京京东尚科信息技术有限公司 Processing method, device and the computer readable storage medium of text data

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103902570A (en) * 2012-12-27 2014-07-02 腾讯科技(深圳)有限公司 Text classification feature extraction method, classification method and device
CN105243130A (en) * 2015-09-29 2016-01-13 中国电子科技集团公司第三十二研究所 Text processing system and method for data mining
CN105930503A (en) * 2016-05-09 2016-09-07 清华大学 Combination feature vector and deep learning based sentiment classification method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103902570A (en) * 2012-12-27 2014-07-02 腾讯科技(深圳)有限公司 Text classification feature extraction method, classification method and device
CN105243130A (en) * 2015-09-29 2016-01-13 中国电子科技集团公司第三十二研究所 Text processing system and method for data mining
CN105930503A (en) * 2016-05-09 2016-09-07 清华大学 Combination feature vector and deep learning based sentiment classification method and device

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107038152A (en) * 2017-03-27 2017-08-11 成都优译信息技术股份有限公司 Text punctuate method and system for drawing typesetting
CN107403375A (en) * 2017-04-19 2017-11-28 北京文因互联科技有限公司 A kind of listed company's bulletin classification and abstraction generating method based on deep learning
CN107657313A (en) * 2017-09-26 2018-02-02 上海数眼科技发展有限公司 The transfer learning system and method for the natural language processing task adapted to based on field
CN107657313B (en) * 2017-09-26 2021-05-18 上海数眼科技发展有限公司 System and method for transfer learning of natural language processing task based on field adaptation
CN109858006B (en) * 2017-11-30 2021-04-09 亿度慧达教育科技(北京)有限公司 Subject identification training method and device
CN109858006A (en) * 2017-11-30 2019-06-07 亿度慧达教育科技(北京)有限公司 Subject recognition training method, apparatus
CN108376287A (en) * 2018-03-02 2018-08-07 复旦大学 Multi-valued attribute segmenting device based on CN-DBpedia and method
CN110309251B (en) * 2018-03-12 2024-01-12 北京京东尚科信息技术有限公司 Text data processing method, device and computer readable storage medium
CN110309251A (en) * 2018-03-12 2019-10-08 北京京东尚科信息技术有限公司 Processing method, device and the computer readable storage medium of text data
CN108829818A (en) * 2018-06-12 2018-11-16 中国科学院计算技术研究所 A kind of file classification method
CN108875024B (en) * 2018-06-20 2020-10-20 清华大学深圳研究生院 Text classification method and system, readable storage medium and electronic equipment
CN108875024A (en) * 2018-06-20 2018-11-23 清华大学深圳研究生院 File classification method, system, readable storage medium storing program for executing and electronic equipment
CN109522407A (en) * 2018-10-26 2019-03-26 平安科技(深圳)有限公司 Business connection prediction technique, device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN106528776A (en) Text classification method and device
CN111414479B (en) Label extraction method based on short text clustering technology
CN108108351B (en) Text emotion classification method based on deep learning combination model
CN109241255B (en) Intention identification method based on deep learning
CN106294593B (en) In conjunction with the Relation extraction method of subordinate clause grade remote supervisory and semi-supervised integrated study
EP2486470B1 (en) System and method for inputting text into electronic devices
CN109635108B (en) Man-machine interaction based remote supervision entity relationship extraction method
CN110851596A (en) Text classification method and device and computer readable storage medium
CN108763539B (en) Text classification method and system based on part-of-speech classification
CN109670041A (en) A kind of band based on binary channels text convolutional neural networks is made an uproar illegal short text recognition methods
CN107451278A (en) Chinese Text Categorization based on more hidden layer extreme learning machines
CN110674252A (en) High-precision semantic search system for judicial domain
CN113505209A (en) Intelligent question-answering system for automobile field
CN107220293B (en) Emotion-based text classification method
CN108520038B (en) Biomedical literature retrieval method based on sequencing learning algorithm
CN110717330A (en) Word-sentence level short text classification method based on deep learning
CN112100365A (en) Two-stage text summarization method
CN111177386A (en) Proposal classification method and system
CN110910175A (en) Tourist ticket product portrait generation method
CN107357895A (en) A kind of processing method of the text representation based on bag of words
CN117454220A (en) Data hierarchical classification method, device, equipment and storage medium
CN111460147A (en) Title short text classification method based on semantic enhancement
CN112183106A (en) Semantic understanding method and device based on phoneme association and deep learning
CN110826298A (en) Statement coding method used in intelligent auxiliary password-fixing system
CN109800305A (en) Based on the microblogging mood classification method marked naturally

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20170322