CN108763539A - A kind of file classification method and system based on parts of speech classification - Google Patents

A kind of file classification method and system based on parts of speech classification Download PDF

Info

Publication number
CN108763539A
CN108763539A CN201810551315.3A CN201810551315A CN108763539A CN 108763539 A CN108763539 A CN 108763539A CN 201810551315 A CN201810551315 A CN 201810551315A CN 108763539 A CN108763539 A CN 108763539A
Authority
CN
China
Prior art keywords
text
classification
word
training
word set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810551315.3A
Other languages
Chinese (zh)
Other versions
CN108763539B (en
Inventor
周可
李兴
曾江峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN201810551315.3A priority Critical patent/CN108763539B/en
Publication of CN108763539A publication Critical patent/CN108763539A/en
Application granted granted Critical
Publication of CN108763539B publication Critical patent/CN108763539B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The invention discloses a kind of file classification methods based on parts of speech classification, including:Training text collection and test text collection are obtained from network, training text collection and test text collection are pre-processed, to obtain training text collection multiple word sets of each text are concentrated with test text, multiple word sets of obtained each text are generated model LDA as input to text subject to be trained, to obtain the mixing probability Distribution Model of text-word set-theme of each text under different themes number, classifier training is carried out using the mixing probability Distribution Model of the multiple text-word set-themes of SVM-train function pairs, to obtain the grader after multiple training, mixing probability Distribution Model using multiple text-word set-themes carries out SVM class predictions as the input of the grader after training.Dimension that the present invention can solve required Feature Words when training pattern present in existing method is high, classification accuracy rate is low and the technical problem of the generalization ability difference of grader.

Description

A kind of file classification method and system based on parts of speech classification
Technical field
The invention belongs to computer depth learning technology fields, more particularly, to a kind of text based on parts of speech classification Sorting technique and system.
Background technology
It is widely used with various social softwares and from media software, the data that internet platform generates daily are also rapid Increase, these data include mainly picture, voice, text etc., wherein based on text.In order to which these mass datas are divided Class is needed by manually these mass datas are screened and extracted, this can take a substantial amount of time and energy, and is classified Effect it is also unsatisfactory.Text Classification precisely in order to improve text classification efficiency and accuracy rate and come into being.
Existing file classification method mainly generates model (Latent Dirichlet using text subject Allocation, abbreviation LDA), it can be used for identifying extensive document sets (Document collection) or corpus (Corpus) subject information hidden in, the method that it uses bag of words (bag of words), this method is by each text Shelves are considered as a word frequency vector, to which text message to be converted to the digital information for ease of modeling.
However, there are the defects that some cannot ignore for the above method:First, no matter this method is in the stage for training grader Or in sorting phase, it is required for the participation of all word sets, the dimension of required Feature Words is high when causing training pattern;In addition, by The contribution difference of different parts of speech and part of speech combination for text classification is not accounted in this method, so as to cause the standard of classification True rate is low, and the generalization ability of grader is poor.
Invention content
For the disadvantages described above or Improvement requirement of the prior art, the present invention provides a kind of texts based on parts of speech classification point Class method and system, it is intended that the dimension for solving required Feature Words when training pattern present in existing method is high, classification Accuracy rate is low and the technical problem of the generalization ability difference of grader.
To achieve the above object, according to one aspect of the present invention, a kind of text classification based on parts of speech classification is provided Method, including:
One, text classifier building process specifically includes following steps:
(1) training text collection and test text collection are obtained from network, training text collection and test text collection is located in advance Reason concentrates multiple word sets of each text to obtain training text collection with test text.
(2) the multiple word sets for each text for obtaining step (1) generate model LDA to text subject as input and carry out Training, to obtain the mixing probability Distribution Model of text-word set-theme of each text under different themes number;
(3) multiple texts-of each word set under different themes number are concentrated using SVM-train function pair training texts The mixing probability Distribution Model of word set-theme carries out classifier training, to obtain the grader after multiple training;
(4) use test text collection general per the mixing of multiple text-word set-themes of the class text under different themes number Input of the rate distributed model as the grader after training in step (3), carries out SVM class predictions, according to SVM class prediction knots Fruit concentrates the concrete class acquisition test text per class text to concentrate each word set under different themes number with test text The macro F1 values of class prediction choose maximum multiple values from these the macro F1 values obtained, according to this corresponding text-of multiple values Mixing probability Distribution Model, corresponding word set and the corresponding theme number of word set-theme establish multiple text classifications respectively Device;
Two, text classification process specifically includes following steps:
(1 ') target text to be sorted is obtained, which is pre-processed, to obtain the more of the target text A word set, and the word set corresponding to each text classifier obtained from reservation step (4) in obtained multiple word sets;
(2 ') generate mould using the multiple word sets for retaining obtained target text in step (1 ') as input to text subject Type LDA is trained, to obtain multiple word sets of target text in the mixed of its text-word set-theme of correspondence number of topics now Close probability Distribution Model;
(3 ') are by each word set of the target text obtained in step (2 ') in its text-word of correspondence number of topics now The input for the correspondence text classifier that the mixing probability Distribution Model of collection-theme is obtained as step (4) carries out class prediction, with Obtain the class prediction result of each text classifier;
The class prediction for each text classifier that (4 ') are obtained according to step (3 ') is as a result, and combine each text classification The weighted value that device is preassigned obtains the final classification result of target text.
Preferably, step (1) specifically includes following sub-step:
Each text that (1-1) concentrates training text collection and test text carries out participle and part-of-speech tagging respectively, with Obtain the mapping table between word segmentation result and part of speech;
(1-2) rejects stop words from the mapping table between the word segmentation result and part of speech obtained in step (1-1), to obtain Updated mapping table;
The updated mapping table that (1-2) is obtained is carried out division processing by (1-3) according to part of speech, and corresponding part of speech is picked It removes, to respectively obtain multiple word sets.
Preferably, multiple word sets that step (1-3) obtains include noun word set, verb word set, noun verb combination word set, Other word sets and all word sets.
Preferably, the method for being trained use to LDA models is Gibbs algorithms, and the input of the algorithm is all training The multiple word sets and hyper parameter of each text in text set.Output is each word set of each text under different themes number Probability distribution.
Preferably, function used in SVM class predictions is the svm-predict functions based on LIBSVM tools, SVM classes Other prediction algorithm selection is one-to-one SVM Multiclass Classifications.
Preferably, it is to adopt to obtain test text to concentrate the macro F1 values of class prediction of each word set under different themes number With following calculation formula:
Wherein n indicates the classification sum of text, F1iIt indicates the F1 values of i-th of classification, and has i=[1, n].
Preferably, the calculation formula of the F1 values of i-th of classification is as follows:
Wherein PiFor accuracy rate, RiFor recall rate,
Accuracy rate PiCalculation formula it is as follows:
Recall rate RiCalculation formula it is as follows:
Wherein, aiIndicate the SVM class prediction results of i-th of text be C and the true classification of the text be also C text Number, wherein C indicate some classification;biIndicate that the SVM class prediction results of i-th of text are the true of classification C and the text Classification is not the textual data of C;ciIndicate that the SVM class predictions result of i-th of text is not the true classification C and this article of the text This true classification is the textual data of C.
Preferably, step (4 ') is specifically, if the text classifier obtained in above-mentioned steps (4) is 1, the step The class prediction result obtained in (3 ') is exactly the final classification result of target text;If the text obtained in above-mentioned steps (4) This grader is three, at this time if in step (3 ') three obtained text classifier class prediction result all same, Using this classification prediction result as final classification as a result, if the class prediction of two of which text classifier is identical, with this The class prediction result of two text classifiers as final classification as a result, if the class prediction of three text classifiers not Together, then the class prediction result of the corresponding text classifier of the macro F1 values of maximum obtained using in step (4) is as final classification knot Fruit.
It is another aspect of this invention to provide that a kind of Text Classification System based on parts of speech classification is provided, including:
Text classifier builds module, specifically includes:
First submodule, for obtaining training text collection and test text collection from network, to training text collection and test text This collection is pre-processed, and multiple word sets of each text are concentrated with test text to obtain training text collection.
The second submodule, multiple word sets of each text for obtaining the first submodule are as input to text subject It generates model LDA to be trained, to obtain the mixing probability of text-word set-theme of each text under different themes number Distributed model;
Third submodule, for concentrating each word set in different themes number using SVM-train function pair training texts Under multiple text-word set-themes mixing probability Distribution Model carry out classifier training, to obtain the classification after multiple training Device;
4th submodule, for using every multiple text-word sets-of the class text under different themes number of test text collection Input of the mixing probability Distribution Model of theme as the grader after training in third submodule, carries out SVM class predictions, root Concentrating the concrete class per class text to obtain test text according to SVM class predictions result and test text concentrates each word set not With the macro F1 values of the class prediction of number of topics now, maximum multiple values are chosen from these the macro F1 values obtained, it is more according to this A mixing probability Distribution Model, corresponding word set and corresponding theme number difference for being worth corresponding text-word set-theme Establish multiple text classifiers;
Text classification module, specifically includes:
5th submodule pre-processes the target text for obtaining target text to be sorted, to obtain the mesh Multiple word sets of text are marked, and each text classifier institute obtained from the 4th submodule of reservation in obtained multiple word sets is right The word set answered;
6th submodule, multiple word sets of the target text for will retain in the 5th submodule are as input to text This theme generates model LDA and is trained, to obtain multiple word sets of target text in its text-word of correspondence number of topics now The mixing probability Distribution Model of collection-theme;
7th submodule, each word set of the target text for will be obtained in the 6th submodule is in its correspondence theme number Under text-word set-theme the input of correspondence text classifier that is obtained as the 4th submodule of mixing probability Distribution Model Class prediction is carried out, to obtain the class prediction result of each text classifier;
8th submodule, the class prediction of each text classifier for being obtained according to the 7th submodule is as a result, and tie The weighted value that each text classifier is preassigned is closed, the final classification result of target text is obtained.
In general, through the invention it is contemplated above technical scheme is compared with the prior art, can obtain down and show Beneficial effect:
1, the present invention fully takes into account contribution difference of the word to semantic meaning representation of different parts of speech, by training text collection according to word Property and part of speech combination be divided into different word sets, and according to the corresponding word set of text classifier constructed in step (4) to pretreatment Word set afterwards is further rejected, to realize the purpose for reducing dimension;
2, the present invention to unknown text classification when carrying out decision, it is contemplated that different word sets deposit the contribution of semantic meaning representation In difference, while certain word sets are to the contribution that the contribution of semantic meaning representation has been more than all word sets, so by step (4) selection compared with The corresponding multiple text training sets of big macro F1 values, it is thus possible to improve the accuracy rate of text classification and the extensive energy of text classifier Power.
Description of the drawings
Fig. 1 is the flow chart of the file classification method the present invention is based on parts of speech classification.
Specific implementation mode
In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.As long as in addition, technical characteristic involved in the various embodiments of the present invention described below It does not constitute a conflict with each other and can be combined with each other.
As shown in Figure 1, the file classification method based on parts of speech classification of the present invention includes:
One, text classifier building process specifically includes following steps:
(1) training text collection and test text collection are obtained from network, training text collection and test text collection is located in advance Reason concentrates multiple word sets of each text to obtain training text collection with test text.
Test text collection is the known class text collection more than or equal to 100 texts, for being screened in subsequent step Text classifier.
This step (1) specifically includes following sub-step:
Each text that (1-1) concentrates training text collection and test text carries out participle and part-of-speech tagging respectively, with Obtain the mapping table between word segmentation result and part of speech;
Specifically, part of speech, that is, noun, verb, adjective, adverbial word etc..
For example, if having in some text in text set in short:" engineer writes book of telling somebody what one's real intentions are.", then this step process Afterwards, the mapping table obtained is:
Engineer's noun
Write verb
It tells somebody what one's real intentions are title word
The detailed process of participle is:For alpha type language, sentence is mainly divided into list by participle according to space A word;For Chinese, participle is that a Chinese character sequence is cut into single word.
More specifically, for the Chinese file classification method based on parts of speech classification in this step, the Chinese word segmentation of use and Part-of-speech tagging tool is the Chinese grammar analysis system NLPIR of institute of computing technology of Chinese science research institute exploitation.
(1-2) rejects stop words from the mapping table between the word segmentation result and part of speech obtained in step (1-1), to obtain Updated mapping table;
Specifically, stop words includes mess code, English character, individual digit, mathematic sign, punctuation mark and high frequency Single word.
For the example in step (1-1), the result obtained after this step process is exactly
Engineer's noun
Write verb
It tells somebody what one's real intentions are title word
The updated mapping table that (1-2) is obtained is carried out division processing by (1-3) according to part of speech, and corresponding part of speech is picked It removes, to respectively obtain multiple word sets;
Specifically, obtained multiple word sets include noun word set, verb word set, noun verb combination word set, other words Collection and all word sets.
Wherein, the word of each text is noun in noun word set;It is corresponding, verb word set then all verbs, noun Verbal phrase intersection includes then noun and verb;Other word sets include adjective and adverbial word.All word sets include noun, verb, shape Hold word, adverbial word.
For the example in above-mentioned steps (1-1), the noun word set obtained in this step is
Engineer
It tells somebody what one's real intentions are book
Obtained verb word set is:
It writes
Obtained noun verb combination word set is:
Engineer
It writes
It tells somebody what one's real intentions are book
Other word sets are sky,
All word sets are:
Engineer
It writes
It tells somebody what one's real intentions are book
The partition process of this step is specifically, construct the regular expressions of different part of speech words in an energy matched text Formula directly divides all words of urtext according to the result of part-of-speech tagging.
(2) each text for obtaining step (1) multiple word sets (name word set, verb collection, noun verbal phrase intersection, its His word set, all word sets) as input to text subject generate model (Latent Dirichlet allocation, referred to as LDA it) is trained, to obtain the mixing probability distribution mould of text-word set-theme of each text under different themes number Type;
Specifically, the method for being trained use to LDA models is Gibbs algorithms, the input of the algorithm is all instructions Practice the multiple word sets and hyper parameter of each text in text set.Output is each word set of each text under different themes number Theme feature (i.e. probability distribution).
Specifically, the iterations of hyper parameter>1000 times, the hyper parameter of LDA models mainly has α, β, K, iter_ Number, wherein α are the Study first of text-theme distribution, β is the theme-Study first of word distribution, and K is manually to set Theme number, value range is [10,150], wherein step-length be 10.Iter_number is time of Gibbs sampling iteration Number.
Show α=50/K, β=0.01 according to a large amount of experiment, and the iterations of Gibbs samplings are greater than or equal to At 1000 times, the theme distribution of text set tends towards stability.In order to enable Markov Chain to have preferably convergence effect, the present invention will Iterations are taken as 1500.Theme number K is a value manually set, needs to be determined according to experimental result.
(3) multiple texts-of each word set under different themes number are concentrated using SVM-train function pair training texts The mixing probability Distribution Model of word set-theme carries out classifier training, to obtain the grader after multiple training;
(4) use test text collection general per the mixing of multiple text-word set-themes of the class text under different themes number Input of the rate distributed model as the grader after training in step (3), carries out SVM class predictions, according to SVM class prediction knots Fruit concentrates the concrete class acquisition test text per class text to concentrate each word set under different themes number with test text The macro F1 values of class prediction, choosing from these the macro F1 values obtained maximum multiple values, (in the present embodiment, which takes Value range is 1 or 3), according to this multiple mixing probability Distribution Models for being worth corresponding text-word set-theme, corresponding word Collection and corresponding theme number establish multiple text classifiers, the quantity of text grader and multiple values of selection respectively Quantity it is identical;
Specifically, classification is history class text, artistic class text, military class text etc. in training data.
Function used in SVM class predictions is the svm-predict functions based on LIBSVM tools, SVM class predictions Algorithms selection is one-to-one SVM Multiclass Classifications, and this method chooses the data of two classifications first, by one of classification As positive class, then another classification trains grader as class is born on the two categorical datas.
It is using as follows to obtain test text and concentrate the macro F1 values of class prediction of each word set under different themes number Calculation formula:
N indicates the classification sum of text, F1iIt indicates the F1 values of i-th of classification, and has i=[1, n],
The calculation formula of the F1 values of wherein i-th classification is as follows:
Wherein PiFor accuracy rate, RiFor recall rate,
Accuracy rate PiCalculation formula it is as follows:
Recall rate RiCalculation formula it is as follows:
Wherein, aiIndicate that the SVM class prediction results of i-th of text are C (wherein C indicates some classification) and the text True classification also be C textual data;biIndicate that the SVM class prediction results of i-th of text are the true of classification C and the text Classification is not the textual data of C;ciIndicate that the SVM class predictions result of i-th of text is not the true classification C and this article of the text This true classification is the textual data of C.
Two, text classification process specifically includes following steps:
(1 ') target text to be sorted is obtained, which is pre-processed, to obtain the more of the target text A word set, and the word set corresponding to each text classifier obtained from reservation step (4) in obtained multiple word sets;
Specifically, the preprocessing process and above-mentioned steps (1) in this step are essentially identical, uniquely difference lies in step Suddenly on the basis of (1-3) handling result, the multiple text classifiers institute obtained from reservation above-mentioned steps (4) in handling result is right The word set answered.
(2 ') generate mould using the multiple word sets for retaining obtained target text in step (1 ') as input to text subject Type (Latent Dirichlet allocation, abbreviation LDA) is trained, to obtain multiple word sets of target text at it The mixing probability Distribution Model of the text-word set-theme of corresponding number of topics now;
Specifically, the method for being trained use to LDA models is Gibbs algorithms, the input target text of the algorithm Multiple word sets and hyper parameter.Output is theme feature (i.e. probability of each word set of target text under different themes number Distribution).
Specifically, the iterations of hyper parameter>1000 times, the hyper parameter of LDA models mainly has α, β, K, iter_ Number, wherein α are the Study first of text-theme distribution, β is the theme-Study first of word distribution, and K is manually to set Theme number, value range is [10,150], wherein step-length be 10.Iter_number is time of Gibbs sampling iteration Number.
Show α=50/K, β=0.01 according to a large amount of experiment, and the iterations of Gibbs samplings are greater than or equal to At 1000 times, the theme distribution of text set tends towards stability.In order to enable Markov Chain to have preferably convergence effect, the present invention will Iterations are taken as 1500.Theme number K is a value manually set, needs to be determined according to experimental result.
(3 ') are by each word set of the target text obtained in step (2 ') in its text-word of correspondence number of topics now The input that the mixing probability Distribution Model of collection-theme obtains corresponding text classifier as step (4) carries out class prediction, with To the class prediction result of each text classifier;
The class prediction for each text classifier that (4 ') are obtained according to step (3 ') is as a result, and combine each text classification The weighted value that device is preassigned obtains the final classification result of target text.
Specifically, if the text classifier obtained in above-mentioned steps (4) is 1, obtained in the step (3 ') Class prediction result is exactly the final classification result of target text.
If the text classifier obtained in above-mentioned steps (4) be three (such as noun classification device, verb classification device, with And adjective grader), at this time if in step (3 ') three obtained text classifier class prediction result all same, Using this classification prediction result as final classification as a result, if the class prediction of two of which text classifier is identical, with this The class prediction result of two text classifiers as final classification as a result, if the class prediction of three text classifiers not Together, then the class prediction result of the corresponding text classifier of the macro F1 values of maximum obtained using in step (4) is as final classification knot Fruit.
As it will be easily appreciated by one skilled in the art that the foregoing is merely illustrative of the preferred embodiments of the present invention, not to The limitation present invention, all within the spirits and principles of the present invention made by all any modification, equivalent and improvement etc., should all include Within protection scope of the present invention.

Claims (9)

1. a kind of file classification method based on parts of speech classification, which is characterized in that including:
One, text classifier building process specifically includes following steps:
(1) training text collection and test text collection are obtained from network, training text collection and test text collection is pre-processed, from And it obtains training text collection and concentrates multiple word sets of each text with test text.
(2) the multiple word sets for each text that step (1) obtains model LDA is generated as input to text subject to instruct Practice, to obtain the mixing probability Distribution Model of text-word set-theme of each text under different themes number;
(3) multiple text-words of each word set under different themes number are concentrated using SVM-train function pair training texts The mixing probability Distribution Model of collection-theme carries out classifier training, to obtain the grader after multiple training;
(4) mixing probability point of the test text collection per multiple text-word set-themes of the class text under different themes number is used Cloth model as in step (3) training after grader input, carry out SVM class predictions, according to SVM class predictions result with Test text concentrates the concrete class per class text to obtain test text and concentrates classification of each word set under different themes number The macro F1 values of prediction, maximum multiple values are chosen from these the macro F1 values obtained, according to this corresponding text-word of multiple values Mixing probability Distribution Model, corresponding word set and the corresponding theme number of collection-theme establish multiple text classifications respectively Device;
Two, text classification process specifically includes following steps:
(1 ') target text to be sorted is obtained, which is pre-processed, to obtain multiple words of the target text Collection, and the word set corresponding to each text classifier obtained from reservation step (4) in obtained multiple word sets;
(2 ') generate model LDA using the multiple word sets for retaining obtained target text in step (1 ') as input to text subject It is trained, to obtain mixing probability of multiple word sets in its text-word set-theme of correspondence number of topics now of target text Distributed model;
(3 ') are by each word set of the target text obtained in step (2 ') in the text-word set-master of its correspondence number of topics now The input for the correspondence text classifier that the mixing probability Distribution Model of topic is obtained as step (4) carries out class prediction, to obtain The class prediction result of each text classifier;
The class prediction for each text classifier that (4 ') are obtained according to step (3 ') is as a result, and combine each text classifier quilt Pre-assigned weighted value obtains the final classification result of target text.
2. the file classification method according to claim 1 based on parts of speech classification, which is characterized in that step (1) is specifically wrapped Include following sub-step:
Each text that (1-1) concentrates training text collection and test text carries out participle and part-of-speech tagging respectively, to obtain Mapping table between word segmentation result and part of speech;
(1-2) rejects stop words from the mapping table between the word segmentation result and part of speech obtained in step (1-1), to be updated Mapping table afterwards;
The updated mapping table that (1-2) is obtained is carried out division processing by (1-3) according to part of speech, and corresponding part of speech is rejected, To respectively obtain multiple word sets.
3. the file classification method according to claim 2 based on parts of speech classification, which is characterized in that step (1-3) obtains Multiple word sets include noun word set, verb word set, noun verb combination word set, other word sets and all word sets.
4. the file classification method according to claim 1 based on parts of speech classification, which is characterized in that carried out to LDA models The method that training uses is Gibbs algorithms, the input of the algorithm be all training texts concentrate each text multiple word sets and Hyper parameter.Output is probability distribution of each word set of each text under different themes number.
5. the file classification method according to claim 1 based on parts of speech classification, which is characterized in that SVM class predictions institute The function used is the svm-predict functions based on LIBSVM tools, and SVM class prediction algorithms selections are one-to-one SVM Multiclass Classification.
6. the file classification method according to claim 1 based on parts of speech classification, which is characterized in that obtain test text collection In the macro F1 values of each class prediction of the word set under different themes number be to use following calculation formula:
Wherein n indicates the classification sum of text, F1iIt indicates the F1 values of i-th of classification, and has i=[1, n].
7. the file classification method according to claim 6 based on parts of speech classification, which is characterized in that the F1 of i-th of classification The calculation formula of value is as follows:
Wherein PiFor accuracy rate, RiFor recall rate,
Accuracy rate PiCalculation formula it is as follows:
Recall rate RiCalculation formula it is as follows:
Wherein, aiIndicate the SVM class prediction results of i-th of text be C and the true classification of the text be also C textual data, Middle C indicates some classification;biIndicate the SVM class prediction results of i-th of text for the true classification of classification C and the text not It is the textual data of C;ciIndicate the SVM class predictions result of i-th of text and be the true of the true classification C of the text and the text Real classification is the textual data of C.
8. the file classification method according to claim 1 based on parts of speech classification, which is characterized in that step (4 ') is specific For if the text classifier obtained in above-mentioned steps (4) is 1, the class prediction result obtained in the step (3 ') is just It is the final classification result of target text;If the text classifier obtained in above-mentioned steps (4) is three, at this time if step The class prediction result all same of three obtained text classifier in (3 '), then using this classification prediction result as final classification As a result, if the class prediction of two of which text classifier is identical, with the class prediction result of the two text classifiers It is macro with the maximum obtained in step (4) as final classification as a result, if the class prediction of three text classifiers is all different The class prediction result of the corresponding text classifier of F1 values is as final classification result.
9. a kind of Text Classification System based on parts of speech classification, which is characterized in that including:
Text classifier builds module, specifically includes:
First submodule, for obtaining training text collection and test text collection from network, to training text collection and test text collection It is pre-processed, multiple word sets of each text is concentrated with test text to obtain training text collection.
Multiple word sets of the second submodule, each text for obtaining the first submodule generate text subject as input Model LDA is trained, to obtain the mixing probability distribution of text-word set-theme of each text under different themes number Model;
Third submodule, for concentrating each word set under different themes number using SVM-train function pair training texts The mixing probability Distribution Model of multiple text-word set-themes carries out classifier training, to obtain the grader after multiple training;
4th submodule, for using every multiple text-word set-themes of the class text under different themes number of test text collection Mixing probability Distribution Model as in third submodule training after grader input, carry out SVM class predictions, according to SVM class predictions result concentrates the concrete class per class text to obtain each word set of test text concentration in difference with test text The macro F1 values of the class prediction of number of topics now choose maximum multiple values from these the macro F1 values obtained, multiple according to this The mixing probability Distribution Model, corresponding word set and corresponding theme number for being worth corresponding text-word set-theme are built respectively Found multiple text classifiers;
Text classification module, specifically includes:
5th submodule pre-processes the target text for obtaining target text to be sorted, to obtain target text This multiple word sets, and corresponding to each text classifier obtained from the 4th submodule of reservation in obtained multiple word sets Word set;
6th submodule, multiple word sets of the target text for will retain in the 5th submodule are as input to text master Topic generates model LDA and is trained, to obtain multiple word sets of target text in its text-word set-of correspondence number of topics now The mixing probability Distribution Model of theme;
7th submodule, each word set of the target text for will be obtained in the 6th submodule correspond to number of topics now at it The input for the correspondence text classifier that the mixing probability Distribution Model of text-word set-theme is obtained as the 4th submodule carries out Class prediction, to obtain the class prediction result of each text classifier;
8th submodule, the class prediction of each text classifier for being obtained according to the 7th submodule is as a result, and combine every The weighted value that a text classifier is preassigned obtains the final classification result of target text.
CN201810551315.3A 2018-05-31 2018-05-31 Text classification method and system based on part-of-speech classification Active CN108763539B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810551315.3A CN108763539B (en) 2018-05-31 2018-05-31 Text classification method and system based on part-of-speech classification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810551315.3A CN108763539B (en) 2018-05-31 2018-05-31 Text classification method and system based on part-of-speech classification

Publications (2)

Publication Number Publication Date
CN108763539A true CN108763539A (en) 2018-11-06
CN108763539B CN108763539B (en) 2020-11-10

Family

ID=64001297

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810551315.3A Active CN108763539B (en) 2018-05-31 2018-05-31 Text classification method and system based on part-of-speech classification

Country Status (1)

Country Link
CN (1) CN108763539B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110032639A (en) * 2018-12-27 2019-07-19 中国银联股份有限公司 By the method, apparatus and storage medium of semantic text data and tag match
CN110413773A (en) * 2019-06-20 2019-11-05 平安科技(深圳)有限公司 Intelligent text classification method, device and computer readable storage medium
CN111090746A (en) * 2019-11-29 2020-05-01 北京明略软件系统有限公司 Method for determining optimal number of subjects, and method and device for training emotion classifier
CN111723206A (en) * 2020-06-19 2020-09-29 北京明略软件系统有限公司 Text classification method and device, computer equipment and storage medium
CN112184133A (en) * 2019-07-02 2021-01-05 黎嘉明 Artificial intelligence-based government office system preset approval and division method
CN113204489A (en) * 2021-05-28 2021-08-03 中国工商银行股份有限公司 Test problem processing method, device and equipment
CN113761911A (en) * 2021-03-17 2021-12-07 中科天玑数据科技股份有限公司 Domain text labeling method based on weak supervision
CN113204489B (en) * 2021-05-28 2024-04-30 中国工商银行股份有限公司 Test problem processing method, device and equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016051220A (en) * 2014-08-28 2016-04-11 有限責任監査法人トーマツ Analytical method, analyzer, and analysis program
CN107291795A (en) * 2017-05-03 2017-10-24 华南理工大学 A kind of dynamic word insertion of combination and the file classification method of part-of-speech tagging

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016051220A (en) * 2014-08-28 2016-04-11 有限責任監査法人トーマツ Analytical method, analyzer, and analysis program
CN107291795A (en) * 2017-05-03 2017-10-24 华南理工大学 A kind of dynamic word insertion of combination and the file classification method of part-of-speech tagging

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张超: "《一种词性标注LDA模型的文本分类方法研究》", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110032639A (en) * 2018-12-27 2019-07-19 中国银联股份有限公司 By the method, apparatus and storage medium of semantic text data and tag match
CN110032639B (en) * 2018-12-27 2023-10-31 中国银联股份有限公司 Method, device and storage medium for matching semantic text data with tag
CN110413773A (en) * 2019-06-20 2019-11-05 平安科技(深圳)有限公司 Intelligent text classification method, device and computer readable storage medium
CN110413773B (en) * 2019-06-20 2023-09-22 平安科技(深圳)有限公司 Intelligent text classification method, device and computer readable storage medium
CN112184133A (en) * 2019-07-02 2021-01-05 黎嘉明 Artificial intelligence-based government office system preset approval and division method
CN111090746A (en) * 2019-11-29 2020-05-01 北京明略软件系统有限公司 Method for determining optimal number of subjects, and method and device for training emotion classifier
CN111090746B (en) * 2019-11-29 2023-04-28 北京明略软件系统有限公司 Method for determining optimal topic quantity, training method and device for emotion classifier
CN111723206A (en) * 2020-06-19 2020-09-29 北京明略软件系统有限公司 Text classification method and device, computer equipment and storage medium
CN111723206B (en) * 2020-06-19 2024-01-19 北京明略软件系统有限公司 Text classification method, apparatus, computer device and storage medium
CN113761911A (en) * 2021-03-17 2021-12-07 中科天玑数据科技股份有限公司 Domain text labeling method based on weak supervision
CN113204489A (en) * 2021-05-28 2021-08-03 中国工商银行股份有限公司 Test problem processing method, device and equipment
CN113204489B (en) * 2021-05-28 2024-04-30 中国工商银行股份有限公司 Test problem processing method, device and equipment

Also Published As

Publication number Publication date
CN108763539B (en) 2020-11-10

Similar Documents

Publication Publication Date Title
CN108763539A (en) A kind of file classification method and system based on parts of speech classification
CN110287494A (en) A method of the short text Similarity matching based on deep learning BERT algorithm
KR20190063978A (en) Automatic classification method of unstructured data
Fahad et al. Inflectional review of deep learning on natural language processing
CN108509409A (en) A method of automatically generating semantic similarity sentence sample
CN107797987A (en) A kind of mixing language material name entity recognition method based on Bi LSTM CNN
Tachicart et al. Automatic identification of Moroccan colloquial Arabic
CN109543036A (en) Text Clustering Method based on semantic similarity
Tasharofi et al. Evaluation of statistical part of speech tagging of Persian text
Anjum et al. Exploring humor in natural language processing: a comprehensive review of JOKER tasks at CLEF symposium 2023
CN110728144A (en) Extraction type document automatic summarization method based on context semantic perception
CN110069632B (en) Deep learning text classification method integrating shallow semantic expression vectors
Teng et al. Emotion recognition from text based on the rough set theory and the support vector machines
CN116257616A (en) Entity relation extraction method and system for music field
CN107818078B (en) Semantic association and matching method for Chinese natural language dialogue
KR20200040032A (en) A method ofr classification of korean postings based on bidirectional lstm-attention
JP6586055B2 (en) Deep case analysis device, deep case learning device, deep case estimation device, method, and program
Ayadi et al. Intertextual distance for Arabic texts classification
Povoda et al. Emotion recognition from helpdesk messages
CN113343667A (en) Network character attribute extraction and relation analysis method based on multi-source information
Zheng A Novel Computer-Aided Emotion Recognition of Text Method Based on WordEmbedding and Bi-LSTM
Basumatary et al. Deep Learning Based Bodo Parts of Speech Tagger
CN108573025A (en) The method and device of sentence characteristic of division is extracted based on hybrid template
Smywinski-Pohl et al. Application of Character-Level Language Models in the Domain of Polish Statutory Law.
Tayal et al. DARNN: Discourse Analysis for Natural languages using RNN and LSTM.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant