CN108763539A

CN108763539A - A kind of file classification method and system based on parts of speech classification

Info

Publication number: CN108763539A
Application number: CN201810551315.3A
Authority: CN
Inventors: 周可; 李兴; 曾江峰
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2018-05-31
Filing date: 2018-05-31
Publication date: 2018-11-06
Anticipated expiration: 2038-05-31
Also published as: CN108763539B

Abstract

The invention discloses a kind of file classification methods based on parts of speech classification, including：Training text collection and test text collection are obtained from network, training text collection and test text collection are pre-processed, to obtain training text collection multiple word sets of each text are concentrated with test text, multiple word sets of obtained each text are generated model LDA as input to text subject to be trained, to obtain the mixing probability Distribution Model of text-word set-theme of each text under different themes number, classifier training is carried out using the mixing probability Distribution Model of the multiple text-word set-themes of SVM-train function pairs, to obtain the grader after multiple training, mixing probability Distribution Model using multiple text-word set-themes carries out SVM class predictions as the input of the grader after training.Dimension that the present invention can solve required Feature Words when training pattern present in existing method is high, classification accuracy rate is low and the technical problem of the generalization ability difference of grader.

Description

A kind of file classification method and system based on parts of speech classification

Technical field

The invention belongs to computer depth learning technology fields, more particularly, to a kind of text based on parts of speech classification Sorting technique and system.

Background technology

It is widely used with various social softwares and from media software, the data that internet platform generates daily are also rapid Increase, these data include mainly picture, voice, text etc., wherein based on text.In order to which these mass datas are divided Class is needed by manually these mass datas are screened and extracted, this can take a substantial amount of time and energy, and is classified Effect it is also unsatisfactory.Text Classification precisely in order to improve text classification efficiency and accuracy rate and come into being.

Existing file classification method mainly generates model (Latent Dirichlet using text subject Allocation, abbreviation LDA), it can be used for identifying extensive document sets (Document collection) or corpus (Corpus) subject information hidden in, the method that it uses bag of words (bag of words), this method is by each text Shelves are considered as a word frequency vector, to which text message to be converted to the digital information for ease of modeling.

However, there are the defects that some cannot ignore for the above method：First, no matter this method is in the stage for training grader Or in sorting phase, it is required for the participation of all word sets, the dimension of required Feature Words is high when causing training pattern；In addition, by The contribution difference of different parts of speech and part of speech combination for text classification is not accounted in this method, so as to cause the standard of classification True rate is low, and the generalization ability of grader is poor.

Invention content

For the disadvantages described above or Improvement requirement of the prior art, the present invention provides a kind of texts based on parts of speech classification point Class method and system, it is intended that the dimension for solving required Feature Words when training pattern present in existing method is high, classification Accuracy rate is low and the technical problem of the generalization ability difference of grader.

To achieve the above object, according to one aspect of the present invention, a kind of text classification based on parts of speech classification is provided Method, including：

One, text classifier building process specifically includes following steps：

(1) training text collection and test text collection are obtained from network, training text collection and test text collection is located in advance Reason concentrates multiple word sets of each text to obtain training text collection with test text.

(2) the multiple word sets for each text for obtaining step (1) generate model LDA to text subject as input and carry out Training, to obtain the mixing probability Distribution Model of text-word set-theme of each text under different themes number；

(3) multiple texts-of each word set under different themes number are concentrated using SVM-train function pair training texts The mixing probability Distribution Model of word set-theme carries out classifier training, to obtain the grader after multiple training；

(4) use test text collection general per the mixing of multiple text-word set-themes of the class text under different themes number Input of the rate distributed model as the grader after training in step (3), carries out SVM class predictions, according to SVM class prediction knots Fruit concentrates the concrete class acquisition test text per class text to concentrate each word set under different themes number with test text The macro F1 values of class prediction choose maximum multiple values from these the macro F1 values obtained, according to this corresponding text-of multiple values Mixing probability Distribution Model, corresponding word set and the corresponding theme number of word set-theme establish multiple text classifications respectively Device；

Two, text classification process specifically includes following steps：

(1 ') target text to be sorted is obtained, which is pre-processed, to obtain the more of the target text A word set, and the word set corresponding to each text classifier obtained from reservation step (4) in obtained multiple word sets；

(2 ') generate mould using the multiple word sets for retaining obtained target text in step (1 ') as input to text subject Type LDA is trained, to obtain multiple word sets of target text in the mixed of its text-word set-theme of correspondence number of topics now Close probability Distribution Model；

(3 ') are by each word set of the target text obtained in step (2 ') in its text-word of correspondence number of topics now The input for the correspondence text classifier that the mixing probability Distribution Model of collection-theme is obtained as step (4) carries out class prediction, with Obtain the class prediction result of each text classifier；

The class prediction for each text classifier that (4 ') are obtained according to step (3 ') is as a result, and combine each text classification The weighted value that device is preassigned obtains the final classification result of target text.

Preferably, step (1) specifically includes following sub-step：

Each text that (1-1) concentrates training text collection and test text carries out participle and part-of-speech tagging respectively, with Obtain the mapping table between word segmentation result and part of speech；

(1-2) rejects stop words from the mapping table between the word segmentation result and part of speech obtained in step (1-1), to obtain Updated mapping table；

The updated mapping table that (1-2) is obtained is carried out division processing by (1-3) according to part of speech, and corresponding part of speech is picked It removes, to respectively obtain multiple word sets.

Preferably, multiple word sets that step (1-3) obtains include noun word set, verb word set, noun verb combination word set, Other word sets and all word sets.

Preferably, the method for being trained use to LDA models is Gibbs algorithms, and the input of the algorithm is all training The multiple word sets and hyper parameter of each text in text set.Output is each word set of each text under different themes number Probability distribution.

Preferably, function used in SVM class predictions is the svm-predict functions based on LIBSVM tools, SVM classes Other prediction algorithm selection is one-to-one SVM Multiclass Classifications.

Preferably, it is to adopt to obtain test text to concentrate the macro F1 values of class prediction of each word set under different themes number With following calculation formula：

Wherein n indicates the classification sum of text, F_1iIt indicates the F1 values of i-th of classification, and has i=[1, n].

Preferably, the calculation formula of the F1 values of i-th of classification is as follows：

Wherein P_iFor accuracy rate, R_iFor recall rate,

Accuracy rate P_iCalculation formula it is as follows：

Recall rate R_iCalculation formula it is as follows：

Wherein, a_iIndicate the SVM class prediction results of i-th of text be C and the true classification of the text be also C text Number, wherein C indicate some classification；b_iIndicate that the SVM class prediction results of i-th of text are the true of classification C and the text Classification is not the textual data of C；c_iIndicate that the SVM class predictions result of i-th of text is not the true classification C and this article of the text This true classification is the textual data of C.

Preferably, step (4 ') is specifically, if the text classifier obtained in above-mentioned steps (4) is 1, the step The class prediction result obtained in (3 ') is exactly the final classification result of target text；If the text obtained in above-mentioned steps (4) This grader is three, at this time if in step (3 ') three obtained text classifier class prediction result all same, Using this classification prediction result as final classification as a result, if the class prediction of two of which text classifier is identical, with this The class prediction result of two text classifiers as final classification as a result, if the class prediction of three text classifiers not Together, then the class prediction result of the corresponding text classifier of the macro F1 values of maximum obtained using in step (4) is as final classification knot Fruit.

It is another aspect of this invention to provide that a kind of Text Classification System based on parts of speech classification is provided, including：

Text classifier builds module, specifically includes：

First submodule, for obtaining training text collection and test text collection from network, to training text collection and test text This collection is pre-processed, and multiple word sets of each text are concentrated with test text to obtain training text collection.

The second submodule, multiple word sets of each text for obtaining the first submodule are as input to text subject It generates model LDA to be trained, to obtain the mixing probability of text-word set-theme of each text under different themes number Distributed model；

Third submodule, for concentrating each word set in different themes number using SVM-train function pair training texts Under multiple text-word set-themes mixing probability Distribution Model carry out classifier training, to obtain the classification after multiple training Device；

4th submodule, for using every multiple text-word sets-of the class text under different themes number of test text collection Input of the mixing probability Distribution Model of theme as the grader after training in third submodule, carries out SVM class predictions, root Concentrating the concrete class per class text to obtain test text according to SVM class predictions result and test text concentrates each word set not With the macro F1 values of the class prediction of number of topics now, maximum multiple values are chosen from these the macro F1 values obtained, it is more according to this A mixing probability Distribution Model, corresponding word set and corresponding theme number difference for being worth corresponding text-word set-theme Establish multiple text classifiers；

Text classification module, specifically includes：

5th submodule pre-processes the target text for obtaining target text to be sorted, to obtain the mesh Multiple word sets of text are marked, and each text classifier institute obtained from the 4th submodule of reservation in obtained multiple word sets is right The word set answered；

6th submodule, multiple word sets of the target text for will retain in the 5th submodule are as input to text This theme generates model LDA and is trained, to obtain multiple word sets of target text in its text-word of correspondence number of topics now The mixing probability Distribution Model of collection-theme；

7th submodule, each word set of the target text for will be obtained in the 6th submodule is in its correspondence theme number Under text-word set-theme the input of correspondence text classifier that is obtained as the 4th submodule of mixing probability Distribution Model Class prediction is carried out, to obtain the class prediction result of each text classifier；

8th submodule, the class prediction of each text classifier for being obtained according to the 7th submodule is as a result, and tie The weighted value that each text classifier is preassigned is closed, the final classification result of target text is obtained.

In general, through the invention it is contemplated above technical scheme is compared with the prior art, can obtain down and show Beneficial effect：

1, the present invention fully takes into account contribution difference of the word to semantic meaning representation of different parts of speech, by training text collection according to word Property and part of speech combination be divided into different word sets, and according to the corresponding word set of text classifier constructed in step (4) to pretreatment Word set afterwards is further rejected, to realize the purpose for reducing dimension；

2, the present invention to unknown text classification when carrying out decision, it is contemplated that different word sets deposit the contribution of semantic meaning representation In difference, while certain word sets are to the contribution that the contribution of semantic meaning representation has been more than all word sets, so by step (4) selection compared with The corresponding multiple text training sets of big macro F1 values, it is thus possible to improve the accuracy rate of text classification and the extensive energy of text classifier Power.

Description of the drawings

Fig. 1 is the flow chart of the file classification method the present invention is based on parts of speech classification.

Specific implementation mode

In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.As long as in addition, technical characteristic involved in the various embodiments of the present invention described below It does not constitute a conflict with each other and can be combined with each other.

As shown in Figure 1, the file classification method based on parts of speech classification of the present invention includes：

One, text classifier building process specifically includes following steps：

Test text collection is the known class text collection more than or equal to 100 texts, for being screened in subsequent step Text classifier.

This step (1) specifically includes following sub-step：

Specifically, part of speech, that is, noun, verb, adjective, adverbial word etc..

For example, if having in some text in text set in short：" engineer writes book of telling somebody what one's real intentions are.", then this step process Afterwards, the mapping table obtained is：

Engineer's noun

Write verb

It tells somebody what one's real intentions are title word

The detailed process of participle is：For alpha type language, sentence is mainly divided into list by participle according to space A word；For Chinese, participle is that a Chinese character sequence is cut into single word.

More specifically, for the Chinese file classification method based on parts of speech classification in this step, the Chinese word segmentation of use and Part-of-speech tagging tool is the Chinese grammar analysis system NLPIR of institute of computing technology of Chinese science research institute exploitation.

Specifically, stop words includes mess code, English character, individual digit, mathematic sign, punctuation mark and high frequency Single word.

For the example in step (1-1), the result obtained after this step process is exactly

Engineer's noun

Write verb

It tells somebody what one's real intentions are title word

The updated mapping table that (1-2) is obtained is carried out division processing by (1-3) according to part of speech, and corresponding part of speech is picked It removes, to respectively obtain multiple word sets；

Specifically, obtained multiple word sets include noun word set, verb word set, noun verb combination word set, other words Collection and all word sets.

Wherein, the word of each text is noun in noun word set；It is corresponding, verb word set then all verbs, noun Verbal phrase intersection includes then noun and verb；Other word sets include adjective and adverbial word.All word sets include noun, verb, shape Hold word, adverbial word.

For the example in above-mentioned steps (1-1), the noun word set obtained in this step is

Engineer

It tells somebody what one's real intentions are book

Obtained verb word set is：

It writes

Obtained noun verb combination word set is：

Engineer

It writes

It tells somebody what one's real intentions are book

Other word sets are sky,

All word sets are：

Engineer

It writes

It tells somebody what one's real intentions are book

The partition process of this step is specifically, construct the regular expressions of different part of speech words in an energy matched text Formula directly divides all words of urtext according to the result of part-of-speech tagging.

(2) each text for obtaining step (1) multiple word sets (name word set, verb collection, noun verbal phrase intersection, its His word set, all word sets) as input to text subject generate model (Latent Dirichlet allocation, referred to as LDA it) is trained, to obtain the mixing probability distribution mould of text-word set-theme of each text under different themes number Type；

Specifically, the method for being trained use to LDA models is Gibbs algorithms, the input of the algorithm is all instructions Practice the multiple word sets and hyper parameter of each text in text set.Output is each word set of each text under different themes number Theme feature (i.e. probability distribution).

Specifically, the iterations of hyper parameter>1000 times, the hyper parameter of LDA models mainly has α, β, K, iter_ Number, wherein α are the Study first of text-theme distribution, β is the theme-Study first of word distribution, and K is manually to set Theme number, value range is [10,150], wherein step-length be 10.Iter_number is time of Gibbs sampling iteration Number.

Show α=50/K, β=0.01 according to a large amount of experiment, and the iterations of Gibbs samplings are greater than or equal to At 1000 times, the theme distribution of text set tends towards stability.In order to enable Markov Chain to have preferably convergence effect, the present invention will Iterations are taken as 1500.Theme number K is a value manually set, needs to be determined according to experimental result.

(4) use test text collection general per the mixing of multiple text-word set-themes of the class text under different themes number Input of the rate distributed model as the grader after training in step (3), carries out SVM class predictions, according to SVM class prediction knots Fruit concentrates the concrete class acquisition test text per class text to concentrate each word set under different themes number with test text The macro F1 values of class prediction, choosing from these the macro F1 values obtained maximum multiple values, (in the present embodiment, which takes Value range is 1 or 3), according to this multiple mixing probability Distribution Models for being worth corresponding text-word set-theme, corresponding word Collection and corresponding theme number establish multiple text classifiers, the quantity of text grader and multiple values of selection respectively Quantity it is identical；

Specifically, classification is history class text, artistic class text, military class text etc. in training data.

Function used in SVM class predictions is the svm-predict functions based on LIBSVM tools, SVM class predictions Algorithms selection is one-to-one SVM Multiclass Classifications, and this method chooses the data of two classifications first, by one of classification As positive class, then another classification trains grader as class is born on the two categorical datas.

It is using as follows to obtain test text and concentrate the macro F1 values of class prediction of each word set under different themes number Calculation formula：

N indicates the classification sum of text, F_1iIt indicates the F1 values of i-th of classification, and has i=[1, n],

The calculation formula of the F1 values of wherein i-th classification is as follows：

Wherein P_iFor accuracy rate, R_iFor recall rate,

Accuracy rate P_iCalculation formula it is as follows：

Recall rate R_iCalculation formula it is as follows：

Wherein, a_iIndicate that the SVM class prediction results of i-th of text are C (wherein C indicates some classification) and the text True classification also be C textual data；b_iIndicate that the SVM class prediction results of i-th of text are the true of classification C and the text Classification is not the textual data of C；c_iIndicate that the SVM class predictions result of i-th of text is not the true classification C and this article of the text This true classification is the textual data of C.

Two, text classification process specifically includes following steps：

Specifically, the preprocessing process and above-mentioned steps (1) in this step are essentially identical, uniquely difference lies in step Suddenly on the basis of (1-3) handling result, the multiple text classifiers institute obtained from reservation above-mentioned steps (4) in handling result is right The word set answered.

(2 ') generate mould using the multiple word sets for retaining obtained target text in step (1 ') as input to text subject Type (Latent Dirichlet allocation, abbreviation LDA) is trained, to obtain multiple word sets of target text at it The mixing probability Distribution Model of the text-word set-theme of corresponding number of topics now；

Specifically, the method for being trained use to LDA models is Gibbs algorithms, the input target text of the algorithm Multiple word sets and hyper parameter.Output is theme feature (i.e. probability of each word set of target text under different themes number Distribution).

(3 ') are by each word set of the target text obtained in step (2 ') in its text-word of correspondence number of topics now The input that the mixing probability Distribution Model of collection-theme obtains corresponding text classifier as step (4) carries out class prediction, with To the class prediction result of each text classifier；

Specifically, if the text classifier obtained in above-mentioned steps (4) is 1, obtained in the step (3 ') Class prediction result is exactly the final classification result of target text.

If the text classifier obtained in above-mentioned steps (4) be three (such as noun classification device, verb classification device, with And adjective grader), at this time if in step (3 ') three obtained text classifier class prediction result all same, Using this classification prediction result as final classification as a result, if the class prediction of two of which text classifier is identical, with this The class prediction result of two text classifiers as final classification as a result, if the class prediction of three text classifiers not Together, then the class prediction result of the corresponding text classifier of the macro F1 values of maximum obtained using in step (4) is as final classification knot Fruit.

As it will be easily appreciated by one skilled in the art that the foregoing is merely illustrative of the preferred embodiments of the present invention, not to The limitation present invention, all within the spirits and principles of the present invention made by all any modification, equivalent and improvement etc., should all include Within protection scope of the present invention.

Claims

1. a kind of file classification method based on parts of speech classification, which is characterized in that including：

One, text classifier building process specifically includes following steps：

(1) training text collection and test text collection are obtained from network, training text collection and test text collection is pre-processed, from And it obtains training text collection and concentrates multiple word sets of each text with test text.

(2) the multiple word sets for each text that step (1) obtains model LDA is generated as input to text subject to instruct Practice, to obtain the mixing probability Distribution Model of text-word set-theme of each text under different themes number；

(3) multiple text-words of each word set under different themes number are concentrated using SVM-train function pair training texts The mixing probability Distribution Model of collection-theme carries out classifier training, to obtain the grader after multiple training；

(4) mixing probability point of the test text collection per multiple text-word set-themes of the class text under different themes number is used Cloth model as in step (3) training after grader input, carry out SVM class predictions, according to SVM class predictions result with Test text concentrates the concrete class per class text to obtain test text and concentrates classification of each word set under different themes number The macro F1 values of prediction, maximum multiple values are chosen from these the macro F1 values obtained, according to this corresponding text-word of multiple values Mixing probability Distribution Model, corresponding word set and the corresponding theme number of collection-theme establish multiple text classifications respectively Device；

Two, text classification process specifically includes following steps：

(1 ') target text to be sorted is obtained, which is pre-processed, to obtain multiple words of the target text Collection, and the word set corresponding to each text classifier obtained from reservation step (4) in obtained multiple word sets；

(2 ') generate model LDA using the multiple word sets for retaining obtained target text in step (1 ') as input to text subject It is trained, to obtain mixing probability of multiple word sets in its text-word set-theme of correspondence number of topics now of target text Distributed model；

(3 ') are by each word set of the target text obtained in step (2 ') in the text-word set-master of its correspondence number of topics now The input for the correspondence text classifier that the mixing probability Distribution Model of topic is obtained as step (4) carries out class prediction, to obtain The class prediction result of each text classifier；

The class prediction for each text classifier that (4 ') are obtained according to step (3 ') is as a result, and combine each text classifier quilt Pre-assigned weighted value obtains the final classification result of target text.

2. the file classification method according to claim 1 based on parts of speech classification, which is characterized in that step (1) is specifically wrapped Include following sub-step：

Each text that (1-1) concentrates training text collection and test text carries out participle and part-of-speech tagging respectively, to obtain Mapping table between word segmentation result and part of speech；

(1-2) rejects stop words from the mapping table between the word segmentation result and part of speech obtained in step (1-1), to be updated Mapping table afterwards；

The updated mapping table that (1-2) is obtained is carried out division processing by (1-3) according to part of speech, and corresponding part of speech is rejected, To respectively obtain multiple word sets.

3. the file classification method according to claim 2 based on parts of speech classification, which is characterized in that step (1-3) obtains Multiple word sets include noun word set, verb word set, noun verb combination word set, other word sets and all word sets.

4. the file classification method according to claim 1 based on parts of speech classification, which is characterized in that carried out to LDA models The method that training uses is Gibbs algorithms, the input of the algorithm be all training texts concentrate each text multiple word sets and Hyper parameter.Output is probability distribution of each word set of each text under different themes number.

5. the file classification method according to claim 1 based on parts of speech classification, which is characterized in that SVM class predictions institute The function used is the svm-predict functions based on LIBSVM tools, and SVM class prediction algorithms selections are one-to-one SVM Multiclass Classification.

6. the file classification method according to claim 1 based on parts of speech classification, which is characterized in that obtain test text collection In the macro F1 values of each class prediction of the word set under different themes number be to use following calculation formula：

7. the file classification method according to claim 6 based on parts of speech classification, which is characterized in that the F1 of i-th of classification The calculation formula of value is as follows：

Wherein P_iFor accuracy rate, R_iFor recall rate,

Accuracy rate P_iCalculation formula it is as follows：

Recall rate R_iCalculation formula it is as follows：

Wherein, a_iIndicate the SVM class prediction results of i-th of text be C and the true classification of the text be also C textual data, Middle C indicates some classification；b_iIndicate the SVM class prediction results of i-th of text for the true classification of classification C and the text not It is the textual data of C；c_iIndicate the SVM class predictions result of i-th of text and be the true of the true classification C of the text and the text Real classification is the textual data of C.

8. the file classification method according to claim 1 based on parts of speech classification, which is characterized in that step (4 ') is specific For if the text classifier obtained in above-mentioned steps (4) is 1, the class prediction result obtained in the step (3 ') is just It is the final classification result of target text；If the text classifier obtained in above-mentioned steps (4) is three, at this time if step The class prediction result all same of three obtained text classifier in (3 '), then using this classification prediction result as final classification As a result, if the class prediction of two of which text classifier is identical, with the class prediction result of the two text classifiers It is macro with the maximum obtained in step (4) as final classification as a result, if the class prediction of three text classifiers is all different The class prediction result of the corresponding text classifier of F1 values is as final classification result.

9. a kind of Text Classification System based on parts of speech classification, which is characterized in that including：

Text classifier builds module, specifically includes：

First submodule, for obtaining training text collection and test text collection from network, to training text collection and test text collection It is pre-processed, multiple word sets of each text is concentrated with test text to obtain training text collection.

Multiple word sets of the second submodule, each text for obtaining the first submodule generate text subject as input Model LDA is trained, to obtain the mixing probability distribution of text-word set-theme of each text under different themes number Model；

Third submodule, for concentrating each word set under different themes number using SVM-train function pair training texts The mixing probability Distribution Model of multiple text-word set-themes carries out classifier training, to obtain the grader after multiple training；

4th submodule, for using every multiple text-word set-themes of the class text under different themes number of test text collection Mixing probability Distribution Model as in third submodule training after grader input, carry out SVM class predictions, according to SVM class predictions result concentrates the concrete class per class text to obtain each word set of test text concentration in difference with test text The macro F1 values of the class prediction of number of topics now choose maximum multiple values from these the macro F1 values obtained, multiple according to this The mixing probability Distribution Model, corresponding word set and corresponding theme number for being worth corresponding text-word set-theme are built respectively Found multiple text classifiers；

Text classification module, specifically includes：

5th submodule pre-processes the target text for obtaining target text to be sorted, to obtain target text This multiple word sets, and corresponding to each text classifier obtained from the 4th submodule of reservation in obtained multiple word sets Word set；

6th submodule, multiple word sets of the target text for will retain in the 5th submodule are as input to text master Topic generates model LDA and is trained, to obtain multiple word sets of target text in its text-word set-of correspondence number of topics now The mixing probability Distribution Model of theme；

7th submodule, each word set of the target text for will be obtained in the 6th submodule correspond to number of topics now at it The input for the correspondence text classifier that the mixing probability Distribution Model of text-word set-theme is obtained as the 4th submodule carries out Class prediction, to obtain the class prediction result of each text classifier；

8th submodule, the class prediction of each text classifier for being obtained according to the 7th submodule is as a result, and combine every The weighted value that a text classifier is preassigned obtains the final classification result of target text.