CN108763539A - A kind of file classification method and system based on parts of speech classification - Google Patents
A kind of file classification method and system based on parts of speech classification Download PDFInfo
- Publication number
- CN108763539A CN108763539A CN201810551315.3A CN201810551315A CN108763539A CN 108763539 A CN108763539 A CN 108763539A CN 201810551315 A CN201810551315 A CN 201810551315A CN 108763539 A CN108763539 A CN 108763539A
- Authority
- CN
- China
- Prior art keywords
- text
- classification
- word
- training
- word set
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Abstract
The invention discloses a kind of file classification methods based on parts of speech classification, including:Training text collection and test text collection are obtained from network, training text collection and test text collection are pre-processed, to obtain training text collection multiple word sets of each text are concentrated with test text, multiple word sets of obtained each text are generated model LDA as input to text subject to be trained, to obtain the mixing probability Distribution Model of text-word set-theme of each text under different themes number, classifier training is carried out using the mixing probability Distribution Model of the multiple text-word set-themes of SVM-train function pairs, to obtain the grader after multiple training, mixing probability Distribution Model using multiple text-word set-themes carries out SVM class predictions as the input of the grader after training.Dimension that the present invention can solve required Feature Words when training pattern present in existing method is high, classification accuracy rate is low and the technical problem of the generalization ability difference of grader.
Description
Technical field
The invention belongs to computer depth learning technology fields, more particularly, to a kind of text based on parts of speech classification
Sorting technique and system.
Background technology
It is widely used with various social softwares and from media software, the data that internet platform generates daily are also rapid
Increase, these data include mainly picture, voice, text etc., wherein based on text.In order to which these mass datas are divided
Class is needed by manually these mass datas are screened and extracted, this can take a substantial amount of time and energy, and is classified
Effect it is also unsatisfactory.Text Classification precisely in order to improve text classification efficiency and accuracy rate and come into being.
Existing file classification method mainly generates model (Latent Dirichlet using text subject
Allocation, abbreviation LDA), it can be used for identifying extensive document sets (Document collection) or corpus
(Corpus) subject information hidden in, the method that it uses bag of words (bag of words), this method is by each text
Shelves are considered as a word frequency vector, to which text message to be converted to the digital information for ease of modeling.
However, there are the defects that some cannot ignore for the above method:First, no matter this method is in the stage for training grader
Or in sorting phase, it is required for the participation of all word sets, the dimension of required Feature Words is high when causing training pattern;In addition, by
The contribution difference of different parts of speech and part of speech combination for text classification is not accounted in this method, so as to cause the standard of classification
True rate is low, and the generalization ability of grader is poor.
Invention content
For the disadvantages described above or Improvement requirement of the prior art, the present invention provides a kind of texts based on parts of speech classification point
Class method and system, it is intended that the dimension for solving required Feature Words when training pattern present in existing method is high, classification
Accuracy rate is low and the technical problem of the generalization ability difference of grader.
To achieve the above object, according to one aspect of the present invention, a kind of text classification based on parts of speech classification is provided
Method, including:
One, text classifier building process specifically includes following steps:
(1) training text collection and test text collection are obtained from network, training text collection and test text collection is located in advance
Reason concentrates multiple word sets of each text to obtain training text collection with test text.
(2) the multiple word sets for each text for obtaining step (1) generate model LDA to text subject as input and carry out
Training, to obtain the mixing probability Distribution Model of text-word set-theme of each text under different themes number;
(3) multiple texts-of each word set under different themes number are concentrated using SVM-train function pair training texts
The mixing probability Distribution Model of word set-theme carries out classifier training, to obtain the grader after multiple training;
(4) use test text collection general per the mixing of multiple text-word set-themes of the class text under different themes number
Input of the rate distributed model as the grader after training in step (3), carries out SVM class predictions, according to SVM class prediction knots
Fruit concentrates the concrete class acquisition test text per class text to concentrate each word set under different themes number with test text
The macro F1 values of class prediction choose maximum multiple values from these the macro F1 values obtained, according to this corresponding text-of multiple values
Mixing probability Distribution Model, corresponding word set and the corresponding theme number of word set-theme establish multiple text classifications respectively
Device;
Two, text classification process specifically includes following steps:
(1 ') target text to be sorted is obtained, which is pre-processed, to obtain the more of the target text
A word set, and the word set corresponding to each text classifier obtained from reservation step (4) in obtained multiple word sets;
(2 ') generate mould using the multiple word sets for retaining obtained target text in step (1 ') as input to text subject
Type LDA is trained, to obtain multiple word sets of target text in the mixed of its text-word set-theme of correspondence number of topics now
Close probability Distribution Model;
(3 ') are by each word set of the target text obtained in step (2 ') in its text-word of correspondence number of topics now
The input for the correspondence text classifier that the mixing probability Distribution Model of collection-theme is obtained as step (4) carries out class prediction, with
Obtain the class prediction result of each text classifier;
The class prediction for each text classifier that (4 ') are obtained according to step (3 ') is as a result, and combine each text classification
The weighted value that device is preassigned obtains the final classification result of target text.
Preferably, step (1) specifically includes following sub-step:
Each text that (1-1) concentrates training text collection and test text carries out participle and part-of-speech tagging respectively, with
Obtain the mapping table between word segmentation result and part of speech;
(1-2) rejects stop words from the mapping table between the word segmentation result and part of speech obtained in step (1-1), to obtain
Updated mapping table;
The updated mapping table that (1-2) is obtained is carried out division processing by (1-3) according to part of speech, and corresponding part of speech is picked
It removes, to respectively obtain multiple word sets.
Preferably, multiple word sets that step (1-3) obtains include noun word set, verb word set, noun verb combination word set,
Other word sets and all word sets.
Preferably, the method for being trained use to LDA models is Gibbs algorithms, and the input of the algorithm is all training
The multiple word sets and hyper parameter of each text in text set.Output is each word set of each text under different themes number
Probability distribution.
Preferably, function used in SVM class predictions is the svm-predict functions based on LIBSVM tools, SVM classes
Other prediction algorithm selection is one-to-one SVM Multiclass Classifications.
Preferably, it is to adopt to obtain test text to concentrate the macro F1 values of class prediction of each word set under different themes number
With following calculation formula:
Wherein n indicates the classification sum of text, F1iIt indicates the F1 values of i-th of classification, and has i=[1, n].
Preferably, the calculation formula of the F1 values of i-th of classification is as follows:
Wherein PiFor accuracy rate, RiFor recall rate,
Accuracy rate PiCalculation formula it is as follows:
Recall rate RiCalculation formula it is as follows:
Wherein, aiIndicate the SVM class prediction results of i-th of text be C and the true classification of the text be also C text
Number, wherein C indicate some classification;biIndicate that the SVM class prediction results of i-th of text are the true of classification C and the text
Classification is not the textual data of C;ciIndicate that the SVM class predictions result of i-th of text is not the true classification C and this article of the text
This true classification is the textual data of C.
Preferably, step (4 ') is specifically, if the text classifier obtained in above-mentioned steps (4) is 1, the step
The class prediction result obtained in (3 ') is exactly the final classification result of target text;If the text obtained in above-mentioned steps (4)
This grader is three, at this time if in step (3 ') three obtained text classifier class prediction result all same,
Using this classification prediction result as final classification as a result, if the class prediction of two of which text classifier is identical, with this
The class prediction result of two text classifiers as final classification as a result, if the class prediction of three text classifiers not
Together, then the class prediction result of the corresponding text classifier of the macro F1 values of maximum obtained using in step (4) is as final classification knot
Fruit.
It is another aspect of this invention to provide that a kind of Text Classification System based on parts of speech classification is provided, including:
Text classifier builds module, specifically includes:
First submodule, for obtaining training text collection and test text collection from network, to training text collection and test text
This collection is pre-processed, and multiple word sets of each text are concentrated with test text to obtain training text collection.
The second submodule, multiple word sets of each text for obtaining the first submodule are as input to text subject
It generates model LDA to be trained, to obtain the mixing probability of text-word set-theme of each text under different themes number
Distributed model;
Third submodule, for concentrating each word set in different themes number using SVM-train function pair training texts
Under multiple text-word set-themes mixing probability Distribution Model carry out classifier training, to obtain the classification after multiple training
Device;
4th submodule, for using every multiple text-word sets-of the class text under different themes number of test text collection
Input of the mixing probability Distribution Model of theme as the grader after training in third submodule, carries out SVM class predictions, root
Concentrating the concrete class per class text to obtain test text according to SVM class predictions result and test text concentrates each word set not
With the macro F1 values of the class prediction of number of topics now, maximum multiple values are chosen from these the macro F1 values obtained, it is more according to this
A mixing probability Distribution Model, corresponding word set and corresponding theme number difference for being worth corresponding text-word set-theme
Establish multiple text classifiers;
Text classification module, specifically includes:
5th submodule pre-processes the target text for obtaining target text to be sorted, to obtain the mesh
Multiple word sets of text are marked, and each text classifier institute obtained from the 4th submodule of reservation in obtained multiple word sets is right
The word set answered;
6th submodule, multiple word sets of the target text for will retain in the 5th submodule are as input to text
This theme generates model LDA and is trained, to obtain multiple word sets of target text in its text-word of correspondence number of topics now
The mixing probability Distribution Model of collection-theme;
7th submodule, each word set of the target text for will be obtained in the 6th submodule is in its correspondence theme number
Under text-word set-theme the input of correspondence text classifier that is obtained as the 4th submodule of mixing probability Distribution Model
Class prediction is carried out, to obtain the class prediction result of each text classifier;
8th submodule, the class prediction of each text classifier for being obtained according to the 7th submodule is as a result, and tie
The weighted value that each text classifier is preassigned is closed, the final classification result of target text is obtained.
In general, through the invention it is contemplated above technical scheme is compared with the prior art, can obtain down and show
Beneficial effect:
1, the present invention fully takes into account contribution difference of the word to semantic meaning representation of different parts of speech, by training text collection according to word
Property and part of speech combination be divided into different word sets, and according to the corresponding word set of text classifier constructed in step (4) to pretreatment
Word set afterwards is further rejected, to realize the purpose for reducing dimension;
2, the present invention to unknown text classification when carrying out decision, it is contemplated that different word sets deposit the contribution of semantic meaning representation
In difference, while certain word sets are to the contribution that the contribution of semantic meaning representation has been more than all word sets, so by step (4) selection compared with
The corresponding multiple text training sets of big macro F1 values, it is thus possible to improve the accuracy rate of text classification and the extensive energy of text classifier
Power.
Description of the drawings
Fig. 1 is the flow chart of the file classification method the present invention is based on parts of speech classification.
Specific implementation mode
In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to the accompanying drawings and embodiments, right
The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and
It is not used in the restriction present invention.As long as in addition, technical characteristic involved in the various embodiments of the present invention described below
It does not constitute a conflict with each other and can be combined with each other.
As shown in Figure 1, the file classification method based on parts of speech classification of the present invention includes:
One, text classifier building process specifically includes following steps:
(1) training text collection and test text collection are obtained from network, training text collection and test text collection is located in advance
Reason concentrates multiple word sets of each text to obtain training text collection with test text.
Test text collection is the known class text collection more than or equal to 100 texts, for being screened in subsequent step
Text classifier.
This step (1) specifically includes following sub-step:
Each text that (1-1) concentrates training text collection and test text carries out participle and part-of-speech tagging respectively, with
Obtain the mapping table between word segmentation result and part of speech;
Specifically, part of speech, that is, noun, verb, adjective, adverbial word etc..
For example, if having in some text in text set in short:" engineer writes book of telling somebody what one's real intentions are.", then this step process
Afterwards, the mapping table obtained is:
Engineer's noun
Write verb
It tells somebody what one's real intentions are title word
The detailed process of participle is:For alpha type language, sentence is mainly divided into list by participle according to space
A word;For Chinese, participle is that a Chinese character sequence is cut into single word.
More specifically, for the Chinese file classification method based on parts of speech classification in this step, the Chinese word segmentation of use and
Part-of-speech tagging tool is the Chinese grammar analysis system NLPIR of institute of computing technology of Chinese science research institute exploitation.
(1-2) rejects stop words from the mapping table between the word segmentation result and part of speech obtained in step (1-1), to obtain
Updated mapping table;
Specifically, stop words includes mess code, English character, individual digit, mathematic sign, punctuation mark and high frequency
Single word.
For the example in step (1-1), the result obtained after this step process is exactly
Engineer's noun
Write verb
It tells somebody what one's real intentions are title word
The updated mapping table that (1-2) is obtained is carried out division processing by (1-3) according to part of speech, and corresponding part of speech is picked
It removes, to respectively obtain multiple word sets;
Specifically, obtained multiple word sets include noun word set, verb word set, noun verb combination word set, other words
Collection and all word sets.
Wherein, the word of each text is noun in noun word set;It is corresponding, verb word set then all verbs, noun
Verbal phrase intersection includes then noun and verb;Other word sets include adjective and adverbial word.All word sets include noun, verb, shape
Hold word, adverbial word.
For the example in above-mentioned steps (1-1), the noun word set obtained in this step is
Engineer
It tells somebody what one's real intentions are book
Obtained verb word set is:
It writes
Obtained noun verb combination word set is:
Engineer
It writes
It tells somebody what one's real intentions are book
Other word sets are sky,
All word sets are:
Engineer
It writes
It tells somebody what one's real intentions are book
The partition process of this step is specifically, construct the regular expressions of different part of speech words in an energy matched text
Formula directly divides all words of urtext according to the result of part-of-speech tagging.
(2) each text for obtaining step (1) multiple word sets (name word set, verb collection, noun verbal phrase intersection, its
His word set, all word sets) as input to text subject generate model (Latent Dirichlet allocation, referred to as
LDA it) is trained, to obtain the mixing probability distribution mould of text-word set-theme of each text under different themes number
Type;
Specifically, the method for being trained use to LDA models is Gibbs algorithms, the input of the algorithm is all instructions
Practice the multiple word sets and hyper parameter of each text in text set.Output is each word set of each text under different themes number
Theme feature (i.e. probability distribution).
Specifically, the iterations of hyper parameter>1000 times, the hyper parameter of LDA models mainly has α, β, K, iter_
Number, wherein α are the Study first of text-theme distribution, β is the theme-Study first of word distribution, and K is manually to set
Theme number, value range is [10,150], wherein step-length be 10.Iter_number is time of Gibbs sampling iteration
Number.
Show α=50/K, β=0.01 according to a large amount of experiment, and the iterations of Gibbs samplings are greater than or equal to
At 1000 times, the theme distribution of text set tends towards stability.In order to enable Markov Chain to have preferably convergence effect, the present invention will
Iterations are taken as 1500.Theme number K is a value manually set, needs to be determined according to experimental result.
(3) multiple texts-of each word set under different themes number are concentrated using SVM-train function pair training texts
The mixing probability Distribution Model of word set-theme carries out classifier training, to obtain the grader after multiple training;
(4) use test text collection general per the mixing of multiple text-word set-themes of the class text under different themes number
Input of the rate distributed model as the grader after training in step (3), carries out SVM class predictions, according to SVM class prediction knots
Fruit concentrates the concrete class acquisition test text per class text to concentrate each word set under different themes number with test text
The macro F1 values of class prediction, choosing from these the macro F1 values obtained maximum multiple values, (in the present embodiment, which takes
Value range is 1 or 3), according to this multiple mixing probability Distribution Models for being worth corresponding text-word set-theme, corresponding word
Collection and corresponding theme number establish multiple text classifiers, the quantity of text grader and multiple values of selection respectively
Quantity it is identical;
Specifically, classification is history class text, artistic class text, military class text etc. in training data.
Function used in SVM class predictions is the svm-predict functions based on LIBSVM tools, SVM class predictions
Algorithms selection is one-to-one SVM Multiclass Classifications, and this method chooses the data of two classifications first, by one of classification
As positive class, then another classification trains grader as class is born on the two categorical datas.
It is using as follows to obtain test text and concentrate the macro F1 values of class prediction of each word set under different themes number
Calculation formula:
N indicates the classification sum of text, F1iIt indicates the F1 values of i-th of classification, and has i=[1, n],
The calculation formula of the F1 values of wherein i-th classification is as follows:
Wherein PiFor accuracy rate, RiFor recall rate,
Accuracy rate PiCalculation formula it is as follows:
Recall rate RiCalculation formula it is as follows:
Wherein, aiIndicate that the SVM class prediction results of i-th of text are C (wherein C indicates some classification) and the text
True classification also be C textual data;biIndicate that the SVM class prediction results of i-th of text are the true of classification C and the text
Classification is not the textual data of C;ciIndicate that the SVM class predictions result of i-th of text is not the true classification C and this article of the text
This true classification is the textual data of C.
Two, text classification process specifically includes following steps:
(1 ') target text to be sorted is obtained, which is pre-processed, to obtain the more of the target text
A word set, and the word set corresponding to each text classifier obtained from reservation step (4) in obtained multiple word sets;
Specifically, the preprocessing process and above-mentioned steps (1) in this step are essentially identical, uniquely difference lies in step
Suddenly on the basis of (1-3) handling result, the multiple text classifiers institute obtained from reservation above-mentioned steps (4) in handling result is right
The word set answered.
(2 ') generate mould using the multiple word sets for retaining obtained target text in step (1 ') as input to text subject
Type (Latent Dirichlet allocation, abbreviation LDA) is trained, to obtain multiple word sets of target text at it
The mixing probability Distribution Model of the text-word set-theme of corresponding number of topics now;
Specifically, the method for being trained use to LDA models is Gibbs algorithms, the input target text of the algorithm
Multiple word sets and hyper parameter.Output is theme feature (i.e. probability of each word set of target text under different themes number
Distribution).
Specifically, the iterations of hyper parameter>1000 times, the hyper parameter of LDA models mainly has α, β, K, iter_
Number, wherein α are the Study first of text-theme distribution, β is the theme-Study first of word distribution, and K is manually to set
Theme number, value range is [10,150], wherein step-length be 10.Iter_number is time of Gibbs sampling iteration
Number.
Show α=50/K, β=0.01 according to a large amount of experiment, and the iterations of Gibbs samplings are greater than or equal to
At 1000 times, the theme distribution of text set tends towards stability.In order to enable Markov Chain to have preferably convergence effect, the present invention will
Iterations are taken as 1500.Theme number K is a value manually set, needs to be determined according to experimental result.
(3 ') are by each word set of the target text obtained in step (2 ') in its text-word of correspondence number of topics now
The input that the mixing probability Distribution Model of collection-theme obtains corresponding text classifier as step (4) carries out class prediction, with
To the class prediction result of each text classifier;
The class prediction for each text classifier that (4 ') are obtained according to step (3 ') is as a result, and combine each text classification
The weighted value that device is preassigned obtains the final classification result of target text.
Specifically, if the text classifier obtained in above-mentioned steps (4) is 1, obtained in the step (3 ')
Class prediction result is exactly the final classification result of target text.
If the text classifier obtained in above-mentioned steps (4) be three (such as noun classification device, verb classification device, with
And adjective grader), at this time if in step (3 ') three obtained text classifier class prediction result all same,
Using this classification prediction result as final classification as a result, if the class prediction of two of which text classifier is identical, with this
The class prediction result of two text classifiers as final classification as a result, if the class prediction of three text classifiers not
Together, then the class prediction result of the corresponding text classifier of the macro F1 values of maximum obtained using in step (4) is as final classification knot
Fruit.
As it will be easily appreciated by one skilled in the art that the foregoing is merely illustrative of the preferred embodiments of the present invention, not to
The limitation present invention, all within the spirits and principles of the present invention made by all any modification, equivalent and improvement etc., should all include
Within protection scope of the present invention.
Claims (9)
1. a kind of file classification method based on parts of speech classification, which is characterized in that including:
One, text classifier building process specifically includes following steps:
(1) training text collection and test text collection are obtained from network, training text collection and test text collection is pre-processed, from
And it obtains training text collection and concentrates multiple word sets of each text with test text.
(2) the multiple word sets for each text that step (1) obtains model LDA is generated as input to text subject to instruct
Practice, to obtain the mixing probability Distribution Model of text-word set-theme of each text under different themes number;
(3) multiple text-words of each word set under different themes number are concentrated using SVM-train function pair training texts
The mixing probability Distribution Model of collection-theme carries out classifier training, to obtain the grader after multiple training;
(4) mixing probability point of the test text collection per multiple text-word set-themes of the class text under different themes number is used
Cloth model as in step (3) training after grader input, carry out SVM class predictions, according to SVM class predictions result with
Test text concentrates the concrete class per class text to obtain test text and concentrates classification of each word set under different themes number
The macro F1 values of prediction, maximum multiple values are chosen from these the macro F1 values obtained, according to this corresponding text-word of multiple values
Mixing probability Distribution Model, corresponding word set and the corresponding theme number of collection-theme establish multiple text classifications respectively
Device;
Two, text classification process specifically includes following steps:
(1 ') target text to be sorted is obtained, which is pre-processed, to obtain multiple words of the target text
Collection, and the word set corresponding to each text classifier obtained from reservation step (4) in obtained multiple word sets;
(2 ') generate model LDA using the multiple word sets for retaining obtained target text in step (1 ') as input to text subject
It is trained, to obtain mixing probability of multiple word sets in its text-word set-theme of correspondence number of topics now of target text
Distributed model;
(3 ') are by each word set of the target text obtained in step (2 ') in the text-word set-master of its correspondence number of topics now
The input for the correspondence text classifier that the mixing probability Distribution Model of topic is obtained as step (4) carries out class prediction, to obtain
The class prediction result of each text classifier;
The class prediction for each text classifier that (4 ') are obtained according to step (3 ') is as a result, and combine each text classifier quilt
Pre-assigned weighted value obtains the final classification result of target text.
2. the file classification method according to claim 1 based on parts of speech classification, which is characterized in that step (1) is specifically wrapped
Include following sub-step:
Each text that (1-1) concentrates training text collection and test text carries out participle and part-of-speech tagging respectively, to obtain
Mapping table between word segmentation result and part of speech;
(1-2) rejects stop words from the mapping table between the word segmentation result and part of speech obtained in step (1-1), to be updated
Mapping table afterwards;
The updated mapping table that (1-2) is obtained is carried out division processing by (1-3) according to part of speech, and corresponding part of speech is rejected,
To respectively obtain multiple word sets.
3. the file classification method according to claim 2 based on parts of speech classification, which is characterized in that step (1-3) obtains
Multiple word sets include noun word set, verb word set, noun verb combination word set, other word sets and all word sets.
4. the file classification method according to claim 1 based on parts of speech classification, which is characterized in that carried out to LDA models
The method that training uses is Gibbs algorithms, the input of the algorithm be all training texts concentrate each text multiple word sets and
Hyper parameter.Output is probability distribution of each word set of each text under different themes number.
5. the file classification method according to claim 1 based on parts of speech classification, which is characterized in that SVM class predictions institute
The function used is the svm-predict functions based on LIBSVM tools, and SVM class prediction algorithms selections are one-to-one SVM
Multiclass Classification.
6. the file classification method according to claim 1 based on parts of speech classification, which is characterized in that obtain test text collection
In the macro F1 values of each class prediction of the word set under different themes number be to use following calculation formula:
Wherein n indicates the classification sum of text, F1iIt indicates the F1 values of i-th of classification, and has i=[1, n].
7. the file classification method according to claim 6 based on parts of speech classification, which is characterized in that the F1 of i-th of classification
The calculation formula of value is as follows:
Wherein PiFor accuracy rate, RiFor recall rate,
Accuracy rate PiCalculation formula it is as follows:
Recall rate RiCalculation formula it is as follows:
Wherein, aiIndicate the SVM class prediction results of i-th of text be C and the true classification of the text be also C textual data,
Middle C indicates some classification;biIndicate the SVM class prediction results of i-th of text for the true classification of classification C and the text not
It is the textual data of C;ciIndicate the SVM class predictions result of i-th of text and be the true of the true classification C of the text and the text
Real classification is the textual data of C.
8. the file classification method according to claim 1 based on parts of speech classification, which is characterized in that step (4 ') is specific
For if the text classifier obtained in above-mentioned steps (4) is 1, the class prediction result obtained in the step (3 ') is just
It is the final classification result of target text;If the text classifier obtained in above-mentioned steps (4) is three, at this time if step
The class prediction result all same of three obtained text classifier in (3 '), then using this classification prediction result as final classification
As a result, if the class prediction of two of which text classifier is identical, with the class prediction result of the two text classifiers
It is macro with the maximum obtained in step (4) as final classification as a result, if the class prediction of three text classifiers is all different
The class prediction result of the corresponding text classifier of F1 values is as final classification result.
9. a kind of Text Classification System based on parts of speech classification, which is characterized in that including:
Text classifier builds module, specifically includes:
First submodule, for obtaining training text collection and test text collection from network, to training text collection and test text collection
It is pre-processed, multiple word sets of each text is concentrated with test text to obtain training text collection.
Multiple word sets of the second submodule, each text for obtaining the first submodule generate text subject as input
Model LDA is trained, to obtain the mixing probability distribution of text-word set-theme of each text under different themes number
Model;
Third submodule, for concentrating each word set under different themes number using SVM-train function pair training texts
The mixing probability Distribution Model of multiple text-word set-themes carries out classifier training, to obtain the grader after multiple training;
4th submodule, for using every multiple text-word set-themes of the class text under different themes number of test text collection
Mixing probability Distribution Model as in third submodule training after grader input, carry out SVM class predictions, according to
SVM class predictions result concentrates the concrete class per class text to obtain each word set of test text concentration in difference with test text
The macro F1 values of the class prediction of number of topics now choose maximum multiple values from these the macro F1 values obtained, multiple according to this
The mixing probability Distribution Model, corresponding word set and corresponding theme number for being worth corresponding text-word set-theme are built respectively
Found multiple text classifiers;
Text classification module, specifically includes:
5th submodule pre-processes the target text for obtaining target text to be sorted, to obtain target text
This multiple word sets, and corresponding to each text classifier obtained from the 4th submodule of reservation in obtained multiple word sets
Word set;
6th submodule, multiple word sets of the target text for will retain in the 5th submodule are as input to text master
Topic generates model LDA and is trained, to obtain multiple word sets of target text in its text-word set-of correspondence number of topics now
The mixing probability Distribution Model of theme;
7th submodule, each word set of the target text for will be obtained in the 6th submodule correspond to number of topics now at it
The input for the correspondence text classifier that the mixing probability Distribution Model of text-word set-theme is obtained as the 4th submodule carries out
Class prediction, to obtain the class prediction result of each text classifier;
8th submodule, the class prediction of each text classifier for being obtained according to the 7th submodule is as a result, and combine every
The weighted value that a text classifier is preassigned obtains the final classification result of target text.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810551315.3A CN108763539B (en) | 2018-05-31 | 2018-05-31 | Text classification method and system based on part-of-speech classification |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810551315.3A CN108763539B (en) | 2018-05-31 | 2018-05-31 | Text classification method and system based on part-of-speech classification |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108763539A true CN108763539A (en) | 2018-11-06 |
CN108763539B CN108763539B (en) | 2020-11-10 |
Family
ID=64001297
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810551315.3A Active CN108763539B (en) | 2018-05-31 | 2018-05-31 | Text classification method and system based on part-of-speech classification |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108763539B (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110032639A (en) * | 2018-12-27 | 2019-07-19 | 中国银联股份有限公司 | By the method, apparatus and storage medium of semantic text data and tag match |
CN110413773A (en) * | 2019-06-20 | 2019-11-05 | 平安科技(深圳)有限公司 | Intelligent text classification method, device and computer readable storage medium |
CN111090746A (en) * | 2019-11-29 | 2020-05-01 | 北京明略软件系统有限公司 | Method for determining optimal number of subjects, and method and device for training emotion classifier |
CN111723206A (en) * | 2020-06-19 | 2020-09-29 | 北京明略软件系统有限公司 | Text classification method and device, computer equipment and storage medium |
CN112184133A (en) * | 2019-07-02 | 2021-01-05 | 黎嘉明 | Artificial intelligence-based government office system preset approval and division method |
CN113204489A (en) * | 2021-05-28 | 2021-08-03 | 中国工商银行股份有限公司 | Test problem processing method, device and equipment |
CN113761911A (en) * | 2021-03-17 | 2021-12-07 | 中科天玑数据科技股份有限公司 | Domain text labeling method based on weak supervision |
CN113204489B (en) * | 2021-05-28 | 2024-04-30 | 中国工商银行股份有限公司 | Test problem processing method, device and equipment |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2016051220A (en) * | 2014-08-28 | 2016-04-11 | 有限責任監査法人トーマツ | Analytical method, analyzer, and analysis program |
CN107291795A (en) * | 2017-05-03 | 2017-10-24 | 华南理工大学 | A kind of dynamic word insertion of combination and the file classification method of part-of-speech tagging |
-
2018
- 2018-05-31 CN CN201810551315.3A patent/CN108763539B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2016051220A (en) * | 2014-08-28 | 2016-04-11 | 有限責任監査法人トーマツ | Analytical method, analyzer, and analysis program |
CN107291795A (en) * | 2017-05-03 | 2017-10-24 | 华南理工大学 | A kind of dynamic word insertion of combination and the file classification method of part-of-speech tagging |
Non-Patent Citations (1)
Title |
---|
张超: "《一种词性标注LDA模型的文本分类方法研究》", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110032639A (en) * | 2018-12-27 | 2019-07-19 | 中国银联股份有限公司 | By the method, apparatus and storage medium of semantic text data and tag match |
CN110032639B (en) * | 2018-12-27 | 2023-10-31 | 中国银联股份有限公司 | Method, device and storage medium for matching semantic text data with tag |
CN110413773A (en) * | 2019-06-20 | 2019-11-05 | 平安科技(深圳)有限公司 | Intelligent text classification method, device and computer readable storage medium |
CN110413773B (en) * | 2019-06-20 | 2023-09-22 | 平安科技(深圳)有限公司 | Intelligent text classification method, device and computer readable storage medium |
CN112184133A (en) * | 2019-07-02 | 2021-01-05 | 黎嘉明 | Artificial intelligence-based government office system preset approval and division method |
CN111090746A (en) * | 2019-11-29 | 2020-05-01 | 北京明略软件系统有限公司 | Method for determining optimal number of subjects, and method and device for training emotion classifier |
CN111090746B (en) * | 2019-11-29 | 2023-04-28 | 北京明略软件系统有限公司 | Method for determining optimal topic quantity, training method and device for emotion classifier |
CN111723206A (en) * | 2020-06-19 | 2020-09-29 | 北京明略软件系统有限公司 | Text classification method and device, computer equipment and storage medium |
CN111723206B (en) * | 2020-06-19 | 2024-01-19 | 北京明略软件系统有限公司 | Text classification method, apparatus, computer device and storage medium |
CN113761911A (en) * | 2021-03-17 | 2021-12-07 | 中科天玑数据科技股份有限公司 | Domain text labeling method based on weak supervision |
CN113204489A (en) * | 2021-05-28 | 2021-08-03 | 中国工商银行股份有限公司 | Test problem processing method, device and equipment |
CN113204489B (en) * | 2021-05-28 | 2024-04-30 | 中国工商银行股份有限公司 | Test problem processing method, device and equipment |
Also Published As
Publication number | Publication date |
---|---|
CN108763539B (en) | 2020-11-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108763539A (en) | A kind of file classification method and system based on parts of speech classification | |
CN110287494A (en) | A method of the short text Similarity matching based on deep learning BERT algorithm | |
KR20190063978A (en) | Automatic classification method of unstructured data | |
Fahad et al. | Inflectional review of deep learning on natural language processing | |
CN108509409A (en) | A method of automatically generating semantic similarity sentence sample | |
CN107797987A (en) | A kind of mixing language material name entity recognition method based on Bi LSTM CNN | |
Tachicart et al. | Automatic identification of Moroccan colloquial Arabic | |
CN109543036A (en) | Text Clustering Method based on semantic similarity | |
Tasharofi et al. | Evaluation of statistical part of speech tagging of Persian text | |
Anjum et al. | Exploring humor in natural language processing: a comprehensive review of JOKER tasks at CLEF symposium 2023 | |
CN110728144A (en) | Extraction type document automatic summarization method based on context semantic perception | |
CN110069632B (en) | Deep learning text classification method integrating shallow semantic expression vectors | |
Teng et al. | Emotion recognition from text based on the rough set theory and the support vector machines | |
CN116257616A (en) | Entity relation extraction method and system for music field | |
CN107818078B (en) | Semantic association and matching method for Chinese natural language dialogue | |
KR20200040032A (en) | A method ofr classification of korean postings based on bidirectional lstm-attention | |
JP6586055B2 (en) | Deep case analysis device, deep case learning device, deep case estimation device, method, and program | |
Ayadi et al. | Intertextual distance for Arabic texts classification | |
Povoda et al. | Emotion recognition from helpdesk messages | |
CN113343667A (en) | Network character attribute extraction and relation analysis method based on multi-source information | |
Zheng | A Novel Computer-Aided Emotion Recognition of Text Method Based on WordEmbedding and Bi-LSTM | |
Basumatary et al. | Deep Learning Based Bodo Parts of Speech Tagger | |
CN108573025A (en) | The method and device of sentence characteristic of division is extracted based on hybrid template | |
Smywinski-Pohl et al. | Application of Character-Level Language Models in the Domain of Polish Statutory Law. | |
Tayal et al. | DARNN: Discourse Analysis for Natural languages using RNN and LSTM. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |