CN108763539B

CN108763539B - Text classification method and system based on part-of-speech classification

Info

Publication number: CN108763539B
Application number: CN201810551315.3A
Authority: CN
Inventors: 周可; 李兴; 曾江峰
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2018-05-31
Filing date: 2018-05-31
Publication date: 2020-11-10
Anticipated expiration: 2038-05-31
Also published as: CN108763539A

Abstract

The invention discloses a text classification method based on part of speech classification, which comprises the following steps: the method comprises the steps of obtaining a training text set and a testing text set from a network, preprocessing the training text set and the testing text set to obtain a plurality of word sets of each text in the training text set and the testing text set, training a text topic generation model LDA by taking the obtained plurality of word sets of each text as input to obtain a text-word set-topic mixed probability distribution model of each text under different topic numbers, carrying out classifier training on the plurality of text-word set-topic mixed probability distribution models by using an SVM-trail function to obtain a plurality of trained classifiers, and carrying out SVM classification prediction by taking the plurality of text-word set-topic mixed probability distribution models as input of the trained classifiers. The method can solve the technical problems of high dimensionality of the required feature words, low classification accuracy and poor generalization capability of the classifier in the prior art during model training.

Description

Text classification method and system based on part-of-speech classification

Technical Field

The invention belongs to the technical field of computer deep learning, and particularly relates to a text classification method and system based on part of speech classification.

Background

As various social software and self-media software are widely used, data generated by internet platforms every day, which mainly include pictures, voices, texts, etc., among which texts are the main, is also rapidly growing. In order to classify the mass data, the mass data needs to be screened and extracted manually, which consumes a lot of time and effort, and the classification effect is not satisfactory. Text classification techniques have been developed to improve the efficiency and accuracy of text classification.

The existing text classification method mainly adopts a text topic generation model (LDA), which can be used to identify Latent topic information in a large-scale Document collection (Document collection) or Corpus (Corpus), and adopts a bag of words (bag of words) method, which treats each Document as a word frequency vector, thereby converting text information into digital information easy to model.

However, the above method has some non-negligible drawbacks: firstly, the method needs the participation of all word sets no matter in the stage of training a classifier or in the stage of classification, so that the dimensionality of the required feature words is high when a model is trained; in addition, the method does not take the contribution difference of different parts of speech and part of speech combinations to the text classification into consideration, so that the classification accuracy is low, and the generalization capability of the classifier is poor.

Disclosure of Invention

In view of the above defects or improvement needs of the prior art, the present invention provides a text classification method and system based on part-of-speech classification, and aims to solve the technical problems of high dimensionality of the required feature words, low classification accuracy and poor generalization capability of the classifier in the prior art when training a model.

To achieve the above object, according to an aspect of the present invention, there is provided a text classification method based on part-of-speech classification, including:

the text classifier construction process specifically comprises the following steps:

(1) and acquiring a training text set and a test text set from the network, and preprocessing the training text set and the test text set to obtain a plurality of word sets of each text in the training text set and the test text set.

(2) Training a text topic generation model LDA by taking the plurality of word sets of each text obtained in the step (1) as input to obtain a text-word set-topic mixed probability distribution model of each text under different topic numbers;

(3) performing classifier training on a plurality of text-word set-theme mixed probability distribution models of each word set in a training text set under different theme numbers by using an SVM-train function to obtain a plurality of trained classifiers;

(4) using a plurality of text-word set-topic mixed probability distribution models of each type of text of the test text set under different topic numbers as the input of the classifier trained in the step (3), performing SVM class prediction, obtaining macro F1 values of each word set under different topic numbers in the test text set according to SVM class prediction results and actual classes of each type of text in the test text set, selecting a maximum plurality of values from the obtained macro F1 values, and respectively establishing a plurality of text classifiers according to the text-word set-topic mixed probability distribution models, corresponding word sets and corresponding topic numbers corresponding to the values;

secondly, a text classification process specifically comprises the following steps:

(1') obtaining a target text to be classified, preprocessing the target text to obtain a plurality of word sets of the target text, and reserving the word sets corresponding to each text classifier obtained in the step (4) from the obtained plurality of word sets;

(2 ') training a text topic generation model LDA by taking the plurality of word sets of the target text reserved in the step (1') as input to obtain a text-word set-topic mixed probability distribution model of the plurality of word sets of the target text under the corresponding topic number;

(3 ') performing category prediction by taking the text-word set-topic mixed probability distribution model of each word set of the target text obtained in the step (2') under the corresponding topic number as the input of the corresponding text classifier obtained in the step (4) to obtain a category prediction result of each text classifier;

and (4 ') obtaining a final classification result of the target text according to the class prediction result of each text classifier obtained in the step (3') and by combining the weight value pre-distributed to each text classifier.

Preferably, step (1) comprises in particular the following sub-steps:

(1-1) performing word segmentation and part-of-speech tagging on each text in the training text set and the test text set respectively to obtain a mapping table between word segmentation results and part-of-speech;

(1-2) removing stop words from the mapping table between the word segmentation result and the part of speech obtained in the step (1-1) to obtain an updated mapping table;

and (1-3) dividing the updated mapping table obtained in the step (1-2) according to the parts of speech, and removing the corresponding parts of speech to respectively obtain a plurality of word sets.

Preferably, the plurality of word sets obtained in step (1-3) include a noun word set, a verb word set, a noun verb word set, other word sets, and all word sets.

Preferably, the method used for training the LDA-model is the Gibbs algorithm, the input of which is a plurality of word sets and hyper-parameters for each text in all training text sets. And outputting probability distribution of each word set of each text under different topic numbers.

Preferably, the function used for SVM category prediction is a SVM-prediction function based on the LIBSVM tool, and the SVM category prediction algorithm selects a one-to-one SVM multi-class classification algorithm.

Preferably, the macro F1 values of category predictions of each word set in the test text set under different topic numbers are obtained by using the following calculation formula:

where n represents the total number of categories of text, F_1iF1 values representing the ith class, and having i ═ 1, n]。

Preferably, the calculation formula of the F1 value of the ith category is as follows:

wherein P is_iFor accuracy, R_iIn order to be able to recall the rate,

accuracy P_iThe calculation formula of (a) is as follows:

recall rate R_iThe calculation formula of (a) is as follows:

wherein, a_iThe number of the texts which represent that the prediction result of the SVM category of the ith text is C and the real category of the text is C, wherein C represents a certain category; b_iThe number of texts which represent that the SVM category prediction result of the ith text is a category C and the real category of the text is not C; c. C_iAnd the prediction result of the SVM category of the ith text is not the real category C of the text, and the real category of the text is the number of the texts of C.

Preferably, in the step (4 '), if the number of the text classifiers obtained in the step (4) is 1, the class prediction result obtained in the step (3') is the final classification result of the target text; if the number of the text classifiers obtained in the step (4) is three, in this case, if the class prediction results of the three text classifiers obtained in the step (3') are all the same, the class prediction result is taken as the final classification result, if the class predictions of two of the text classifiers are the same, the class prediction results of the two text classifiers are taken as the final classification result, and if the class predictions of the three text classifiers are all different, the class prediction result of the text classifier corresponding to the maximum macro F1 value obtained in the step (4) is taken as the final classification result.

According to another aspect of the present invention, there is provided a text classification system based on part-of-speech classification, including:

the text classifier building module specifically comprises:

the first sub-module is used for acquiring a training text set and a test text set from a network, and preprocessing the training text set and the test text set to acquire a plurality of word sets of each text in the training text set and the test text set.

The second submodule is used for training a text theme generation model LDA by taking the plurality of word sets of each text obtained by the first submodule as input so as to obtain a text-word set-theme mixed probability distribution model of each text under different theme numbers;

the third submodule is used for carrying out classifier training on a mixed probability distribution model of a plurality of texts, word sets and topics of each word set in a training text set under different topic numbers by using an SVM-train function so as to obtain a plurality of trained classifiers;

a fourth sub-module, configured to use a multiple text-word set-topic mixed probability distribution model of each type of text in the test text set under different topic numbers as an input of the classifier trained in the third sub-module, perform SVM category prediction, obtain a macro F1 value of each word set under different topic numbers in the test text set according to an SVM category prediction result and an actual category of each type of text in the test text set, select a maximum multiple values from the obtained macro F1 values, and respectively establish multiple text classifiers according to a text-word set-topic mixed probability distribution model corresponding to the multiple values, a corresponding word set, and a corresponding topic number;

the text classification module specifically comprises:

the fifth sub-module is used for acquiring a target text to be classified, preprocessing the target text to acquire a plurality of word sets of the target text, and reserving the word sets corresponding to each text classifier acquired in the fourth sub-module from the acquired plurality of word sets;

the sixth submodule is used for training the text topic generation model LDA by taking the plurality of word sets of the target text reserved in the fifth submodule as input so as to obtain a text-word set-topic mixed probability distribution model of the plurality of word sets of the target text under the corresponding topic number;

a seventh sub-module, configured to perform category prediction on a text-word set-topic mixed probability distribution model of each word set of the target text obtained in the sixth sub-module in the number of corresponding topics as an input of a corresponding text classifier obtained in the fourth sub-module, so as to obtain a category prediction result of each text classifier;

and the eighth submodule is used for obtaining a final classification result of the target text according to the class prediction result of each text classifier obtained by the seventh submodule and by combining the weight value pre-distributed to each text classifier.

In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects:

1. according to the method, the contribution difference of words with different parts of speech to semantic expression is fully considered, a training text set is divided into different word sets according to the combination of the parts of speech and the parts of speech, and the preprocessed word sets are further removed according to the word sets corresponding to the text classifier constructed in the step (4), so that the purpose of reducing dimensionality is achieved;

2. when the method is used for making a decision on the unknown text category, the differences of the contributions of different word sets to semantic expression are considered, meanwhile, the contributions of some word sets to semantic expression exceed the contributions of all word sets, and then a plurality of text training sets corresponding to a large macro F1 value are selected through the step (4), so that the accuracy of text classification and the generalization capability of a text classifier can be improved.

Drawings

FIG. 1 is a flow chart of a text classification method based on part of speech classification according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

As shown in fig. 1, the text classification method based on part-of-speech classification of the present invention includes:

The test text set is a known category text set with more than or equal to 100 texts and is used for screening the text classifier in the subsequent step.

The step (1) specifically comprises the following substeps:

specifically, the parts of speech are nouns, verbs, adjectives, adverbs, and the like.

For example, if there is a sentence in a certain text in the text set: "Engineers write handed books. ", the mapping table obtained after the processing in this step is:

engineer noun

Verb-to-write

Book noun with bottom of book

The specific process of word segmentation is as follows: for letter type languages, word segmentation mainly divides sentences into single words according to spaces; for Chinese, word segmentation is the segmentation of a sequence of Chinese characters into individual words.

More specifically, in the text classification method based on part-of-speech classification for Chinese in this step, the adopted Chinese word segmentation and part-of-speech tagging tool is a Chinese grammar analysis system NLPIR developed by the institute of computer technology of the Chinese research institute.

specifically, stop words include scrambles, English characters, single numbers, mathematical symbols, punctuation marks, and high frequency single words.

As for the example in the step (1-1), the results obtained after the treatment in this step are

Engineer noun

Verb-to-write

Book noun with bottom of book

(1-3) dividing the updated mapping table obtained in the step (1-2) according to the parts of speech, and removing the corresponding parts of speech to respectively obtain a plurality of word sets;

specifically, the obtained plurality of word sets include a noun word set, a verb word set, a noun verb combined word set, other word sets, and all word sets.

Wherein, the words of each text in the noun word set are nouns; correspondingly, the verb word set is all verbs, and the noun verb word set comprises nouns and verbs; other word sets include adjectives and adverbs. All word sets include nouns, verbs, adjectives, adverbs.

For the example in the step (1-1), the noun word set obtained in this step is

Engineer(s)

Book with crossed bottom

The resulting set of verb words is:

writing

The resulting noun action word combination word set is:

engineer(s)

Writing

Book with crossed bottom

The set of other words is empty and,

all the word sets are:

engineer(s)

Writing

Book with crossed bottom

The dividing process in the step is specifically to construct a regular expression which can be matched with words of different parts of speech in the text, and all words of the original text are divided directly according to the result of part of speech tagging.

(2) Training a text topic generation model (LDA) by taking a plurality of word sets (a name word set, a moving word set, a noun moving word combination set, other word sets and all word sets) of each text obtained in the step (1) as input so as to obtain a text-word set-topic mixed probability distribution model of each text under different topic numbers;

specifically, the method adopted for training the LDA model is the Gibbs algorithm, and the input of the algorithm is a plurality of word sets and hyper-parameters of each text in all training text sets. The topic features (i.e., probability distributions) for each set of words of each text are output for different number of topics.

Specifically, the iteration times of the hyper-parameters are more than 1000, the hyper-parameters of the LDA model mainly include α, β, K, iter _ number, where α is a prior parameter of text-topic distribution, β is a prior parameter of topic-word distribution, K is the number of topics set manually, and the value range is [10,150], where the step length is 10. iter _ number is the number of Gibbs sampling iterations.

It has been shown from a number of experiments that the topic distribution of the corpus tends to stabilize when α is 50/K, β is 0.01, and the number of iterations of Gibbs sampling is greater than or equal to 1000. In order to enable the Markov chain to have better convergence effect, the iteration number is 1500. The number of subjects K is a manually set value and needs to be determined based on experimental results.

(4) using a plurality of text-word set-topic mixed probability distribution models of each type of text in the test text set under different topic numbers as the input of the classifier trained in the step (3), performing SVM class prediction, obtaining macro F1 values of each word set under different topic numbers in the test text set according to SVM class prediction results and actual classes of each type of text in the test text set, selecting a plurality of maximum values (in the embodiment, the value range of the values is 1 or 3) from the obtained macro F1 values, and respectively establishing a plurality of text classifiers according to the text-word set-topic mixed probability distribution models, the corresponding word sets and the corresponding topic numbers corresponding to the values, wherein the number of the text classifiers is the same as the number of the selected values;

specifically, the category is a history-like text, an art-like text, a military-like text, or the like in the training data.

The SVM category prediction method comprises the steps of firstly selecting data of two categories, taking one category as a positive category and taking the other category as a negative category, and then training a classifier on the data of the two categories.

Obtaining the macro F1 value of the category prediction of each word set in the test text set under different topic numbers by adopting the following calculation formula:

n denotes the total number of categories of text, F_1iF1 values representing the ith class, and having i ═ 1, n]，

Wherein the calculation formula of the F1 value of the ith category is as follows:

wherein P is_iFor accuracy, R_iIn order to be able to recall the rate,

accuracy P_iThe calculation formula of (a) is as follows:

recall rate R_iThe calculation formula of (a) is as follows:

wherein, a_iThe number of texts indicating that the prediction result of the SVM category of the ith text is C (wherein C represents a certain category) and the real category of the text is C; b_iThe number of texts which represent that the SVM category prediction result of the ith text is a category C and the real category of the text is not C; c. C_iAnd the prediction result of the SVM category of the ith text is not the real category C of the text, and the real category of the text is the number of the texts of C.

specifically, the preprocessing process in this step is substantially the same as that in step (1), and the only difference is that, on the basis of the processing result in step (1-3), the word sets corresponding to the plurality of text classifiers obtained in step (4) are retained from the processing result.

(2 ') training a text topic generation model (LDA) by taking the plurality of word sets of the target text reserved in the step (1') as input to obtain a text-word set-topic mixed probability distribution model of the plurality of word sets of the target text under the corresponding topic number;

specifically, the method adopted for training the LDA model is the Gibbs algorithm, which inputs a plurality of word sets and hyper-parameters of the target text. The topic characteristics (namely probability distribution) of each word set of the target text under different topic numbers are output.

(3 ') taking the mixed probability distribution model of the text, the word sets and the subjects of each word set of the target text obtained in the step (2') under the corresponding number of the subjects as the input of the corresponding text classifier obtained in the step (4) for class prediction to obtain a class prediction result of each text classifier;

Specifically, if the number of text classifiers obtained in step (4) is 1, the class prediction result obtained in step (3') is the final classification result of the target text.

If the text classifiers obtained in the step (4) are three (such as a noun classifier, a verb classifier, and an adjective classifier), then if the class prediction results of the three text classifiers obtained in the step (3') are all the same, the class prediction result is taken as the final classification result, if the class predictions of two of the text classifiers are the same, the class prediction results of the two text classifiers are taken as the final classification result, and if the class predictions of the three text classifiers are all different, the class prediction result of the text classifier corresponding to the maximum macro F1 value obtained in the step (4) is taken as the final classification result.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A text classification method based on part of speech classification is characterized by comprising the following steps:

(1) acquiring a training text set and a test text set from a network, and preprocessing the training text set and the test text set to obtain a plurality of word sets of each text in the training text set and the test text set; the step (1) specifically comprises the following substeps:

(1-3) dividing the updated mapping table obtained in the step (1-2) according to the parts of speech, and removing the corresponding parts of speech to respectively obtain a plurality of word sets; the dividing process of the step is specifically that a regular expression which can be matched with words of different parts of speech in the text is constructed, and all words of the original text are divided directly according to the result of part of speech tagging;

2. The method for classifying text based on part-of-speech classification as claimed in claim 1, wherein the plurality of word sets obtained in step (1-3) include a noun word set, a verb word set, a noun verb word set, other word sets, and all word sets.

3. The method of claim 1, wherein the LDA model is trained by a Gibbs algorithm, the input of which is a plurality of word sets and hyper-parameters of each text in all training text sets; and outputting probability distribution of each word set of each text under different topic numbers.

4. The method of claim 1, wherein the function used for SVM class prediction is a SVM-prediction function based on LIBSVM tool, and the SVM class prediction algorithm selects a one-to-one SVM multi-class classification algorithm.

5. The method for classifying texts based on part-of-speech classification as claimed in claim 1, wherein the macro F1 values of class predictions of each word set in the test text set under different topic numbers are obtained by using the following calculation formula:

where n denotes the total number of categories of the test text, F_1iF1 values representing the ith class, and having i ═ 1, n]。

6. The method for classifying text according to part of speech according to claim 5, wherein the F1 value of the ith category is calculated as follows:

wherein P is_iFor accuracy, R_iIn order to be able to recall the rate,

accuracy P_iThe calculation formula of (a) is as follows:

recall rate R_iThe calculation formula of (a) is as follows:

wherein, a_iThe number of texts which represent that the prediction result of the SVM category of the ith test text set is C and the real category of the text is C, wherein C represents a certain category; b_iThe SVM category prediction result of the ith test text set is represented as a category C, and the actual category of the text is not the number of the texts of C; c. C_iAnd the prediction result of the SVM category representing the ith test text set is not the real category C of the text, and the real category of the text is the number of texts of C.

7. The method for classifying texts based on parts of speech classification as claimed in claim 1, wherein the step (4 ') is specifically configured such that if the number of the text classifiers obtained in the step (4) is 1, the classification prediction result obtained in the step (3') is the final classification result of the target text; if the number of the text classifiers obtained in the step (4) is three, in this case, if the class prediction results of the three text classifiers obtained in the step (3') are all the same, the class prediction result is taken as the final classification result, if the class predictions of two of the text classifiers are the same, the class prediction results of the two text classifiers are taken as the final classification result, and if the class predictions of the three text classifiers are all different, the class prediction result of the text classifier corresponding to the maximum macro F1 value obtained in the step (4) is taken as the final classification result.

8. A system for classifying text based on part-of-speech classification, comprising:

the text classifier building module specifically comprises:

the first sub-module is used for acquiring a training text set and a test text set from a network, and preprocessing the training text set and the test text set to acquire a plurality of word sets of each text in the training text set and the test text set; the module specifically comprises the following subunits:

the first subunit is used for respectively carrying out word segmentation and part-of-speech tagging on each text in the training text set and the test text set so as to obtain a mapping table between word segmentation results and part-of-speech;

the second subunit is used for eliminating stop words from the mapping table between the word segmentation result and the part of speech obtained in the first subunit to obtain an updated mapping table;

the third subunit is used for dividing the updated mapping table obtained by the second subunit according to the part of speech and removing the corresponding part of speech to obtain a plurality of word sets respectively; the sub-unit dividing process specifically includes constructing a regular expression capable of matching different part-of-speech words in a text, and directly dividing all words of an original text according to part-of-speech tagging results;

the text classification module specifically comprises: