CN108763539B - Text classification method and system based on part-of-speech classification - Google Patents

Text classification method and system based on part-of-speech classification Download PDF

Info

Publication number
CN108763539B
CN108763539B CN201810551315.3A CN201810551315A CN108763539B CN 108763539 B CN108763539 B CN 108763539B CN 201810551315 A CN201810551315 A CN 201810551315A CN 108763539 B CN108763539 B CN 108763539B
Authority
CN
China
Prior art keywords
text
word
topic
training
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810551315.3A
Other languages
Chinese (zh)
Other versions
CN108763539A (en
Inventor
周可
李兴
曾江峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN201810551315.3A priority Critical patent/CN108763539B/en
Publication of CN108763539A publication Critical patent/CN108763539A/en
Application granted granted Critical
Publication of CN108763539B publication Critical patent/CN108763539B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The invention discloses a text classification method based on part of speech classification, which comprises the following steps: the method comprises the steps of obtaining a training text set and a testing text set from a network, preprocessing the training text set and the testing text set to obtain a plurality of word sets of each text in the training text set and the testing text set, training a text topic generation model LDA by taking the obtained plurality of word sets of each text as input to obtain a text-word set-topic mixed probability distribution model of each text under different topic numbers, carrying out classifier training on the plurality of text-word set-topic mixed probability distribution models by using an SVM-trail function to obtain a plurality of trained classifiers, and carrying out SVM classification prediction by taking the plurality of text-word set-topic mixed probability distribution models as input of the trained classifiers. The method can solve the technical problems of high dimensionality of the required feature words, low classification accuracy and poor generalization capability of the classifier in the prior art during model training.

Description

Text classification method and system based on part-of-speech classification
Technical Field
The invention belongs to the technical field of computer deep learning, and particularly relates to a text classification method and system based on part of speech classification.
Background
As various social software and self-media software are widely used, data generated by internet platforms every day, which mainly include pictures, voices, texts, etc., among which texts are the main, is also rapidly growing. In order to classify the mass data, the mass data needs to be screened and extracted manually, which consumes a lot of time and effort, and the classification effect is not satisfactory. Text classification techniques have been developed to improve the efficiency and accuracy of text classification.
The existing text classification method mainly adopts a text topic generation model (LDA), which can be used to identify Latent topic information in a large-scale Document collection (Document collection) or Corpus (Corpus), and adopts a bag of words (bag of words) method, which treats each Document as a word frequency vector, thereby converting text information into digital information easy to model.
However, the above method has some non-negligible drawbacks: firstly, the method needs the participation of all word sets no matter in the stage of training a classifier or in the stage of classification, so that the dimensionality of the required feature words is high when a model is trained; in addition, the method does not take the contribution difference of different parts of speech and part of speech combinations to the text classification into consideration, so that the classification accuracy is low, and the generalization capability of the classifier is poor.
Disclosure of Invention
In view of the above defects or improvement needs of the prior art, the present invention provides a text classification method and system based on part-of-speech classification, and aims to solve the technical problems of high dimensionality of the required feature words, low classification accuracy and poor generalization capability of the classifier in the prior art when training a model.
To achieve the above object, according to an aspect of the present invention, there is provided a text classification method based on part-of-speech classification, including:
the text classifier construction process specifically comprises the following steps:
(1) and acquiring a training text set and a test text set from the network, and preprocessing the training text set and the test text set to obtain a plurality of word sets of each text in the training text set and the test text set.
(2) Training a text topic generation model LDA by taking the plurality of word sets of each text obtained in the step (1) as input to obtain a text-word set-topic mixed probability distribution model of each text under different topic numbers;
(3) performing classifier training on a plurality of text-word set-theme mixed probability distribution models of each word set in a training text set under different theme numbers by using an SVM-train function to obtain a plurality of trained classifiers;
(4) using a plurality of text-word set-topic mixed probability distribution models of each type of text of the test text set under different topic numbers as the input of the classifier trained in the step (3), performing SVM class prediction, obtaining macro F1 values of each word set under different topic numbers in the test text set according to SVM class prediction results and actual classes of each type of text in the test text set, selecting a maximum plurality of values from the obtained macro F1 values, and respectively establishing a plurality of text classifiers according to the text-word set-topic mixed probability distribution models, corresponding word sets and corresponding topic numbers corresponding to the values;
secondly, a text classification process specifically comprises the following steps:
(1') obtaining a target text to be classified, preprocessing the target text to obtain a plurality of word sets of the target text, and reserving the word sets corresponding to each text classifier obtained in the step (4) from the obtained plurality of word sets;
(2 ') training a text topic generation model LDA by taking the plurality of word sets of the target text reserved in the step (1') as input to obtain a text-word set-topic mixed probability distribution model of the plurality of word sets of the target text under the corresponding topic number;
(3 ') performing category prediction by taking the text-word set-topic mixed probability distribution model of each word set of the target text obtained in the step (2') under the corresponding topic number as the input of the corresponding text classifier obtained in the step (4) to obtain a category prediction result of each text classifier;
and (4 ') obtaining a final classification result of the target text according to the class prediction result of each text classifier obtained in the step (3') and by combining the weight value pre-distributed to each text classifier.
Preferably, step (1) comprises in particular the following sub-steps:
(1-1) performing word segmentation and part-of-speech tagging on each text in the training text set and the test text set respectively to obtain a mapping table between word segmentation results and part-of-speech;
(1-2) removing stop words from the mapping table between the word segmentation result and the part of speech obtained in the step (1-1) to obtain an updated mapping table;
and (1-3) dividing the updated mapping table obtained in the step (1-2) according to the parts of speech, and removing the corresponding parts of speech to respectively obtain a plurality of word sets.
Preferably, the plurality of word sets obtained in step (1-3) include a noun word set, a verb word set, a noun verb word set, other word sets, and all word sets.
Preferably, the method used for training the LDA-model is the Gibbs algorithm, the input of which is a plurality of word sets and hyper-parameters for each text in all training text sets. And outputting probability distribution of each word set of each text under different topic numbers.
Preferably, the function used for SVM category prediction is a SVM-prediction function based on the LIBSVM tool, and the SVM category prediction algorithm selects a one-to-one SVM multi-class classification algorithm.
Preferably, the macro F1 values of category predictions of each word set in the test text set under different topic numbers are obtained by using the following calculation formula:
Figure BDA0001680307410000031
where n represents the total number of categories of text, F1iF1 values representing the ith class, and having i ═ 1, n]。
Preferably, the calculation formula of the F1 value of the ith category is as follows:
Figure BDA0001680307410000041
wherein P isiFor accuracy, RiIn order to be able to recall the rate,
accuracy PiThe calculation formula of (a) is as follows:
Figure BDA0001680307410000042
recall rate RiThe calculation formula of (a) is as follows:
Figure BDA0001680307410000043
wherein, aiThe number of the texts which represent that the prediction result of the SVM category of the ith text is C and the real category of the text is C, wherein C represents a certain category; biThe number of texts which represent that the SVM category prediction result of the ith text is a category C and the real category of the text is not C; c. CiAnd the prediction result of the SVM category of the ith text is not the real category C of the text, and the real category of the text is the number of the texts of C.
Preferably, in the step (4 '), if the number of the text classifiers obtained in the step (4) is 1, the class prediction result obtained in the step (3') is the final classification result of the target text; if the number of the text classifiers obtained in the step (4) is three, in this case, if the class prediction results of the three text classifiers obtained in the step (3') are all the same, the class prediction result is taken as the final classification result, if the class predictions of two of the text classifiers are the same, the class prediction results of the two text classifiers are taken as the final classification result, and if the class predictions of the three text classifiers are all different, the class prediction result of the text classifier corresponding to the maximum macro F1 value obtained in the step (4) is taken as the final classification result.
According to another aspect of the present invention, there is provided a text classification system based on part-of-speech classification, including:
the text classifier building module specifically comprises:
the first sub-module is used for acquiring a training text set and a test text set from a network, and preprocessing the training text set and the test text set to acquire a plurality of word sets of each text in the training text set and the test text set.
The second submodule is used for training a text theme generation model LDA by taking the plurality of word sets of each text obtained by the first submodule as input so as to obtain a text-word set-theme mixed probability distribution model of each text under different theme numbers;
the third submodule is used for carrying out classifier training on a mixed probability distribution model of a plurality of texts, word sets and topics of each word set in a training text set under different topic numbers by using an SVM-train function so as to obtain a plurality of trained classifiers;
a fourth sub-module, configured to use a multiple text-word set-topic mixed probability distribution model of each type of text in the test text set under different topic numbers as an input of the classifier trained in the third sub-module, perform SVM category prediction, obtain a macro F1 value of each word set under different topic numbers in the test text set according to an SVM category prediction result and an actual category of each type of text in the test text set, select a maximum multiple values from the obtained macro F1 values, and respectively establish multiple text classifiers according to a text-word set-topic mixed probability distribution model corresponding to the multiple values, a corresponding word set, and a corresponding topic number;
the text classification module specifically comprises:
the fifth sub-module is used for acquiring a target text to be classified, preprocessing the target text to acquire a plurality of word sets of the target text, and reserving the word sets corresponding to each text classifier acquired in the fourth sub-module from the acquired plurality of word sets;
the sixth submodule is used for training the text topic generation model LDA by taking the plurality of word sets of the target text reserved in the fifth submodule as input so as to obtain a text-word set-topic mixed probability distribution model of the plurality of word sets of the target text under the corresponding topic number;
a seventh sub-module, configured to perform category prediction on a text-word set-topic mixed probability distribution model of each word set of the target text obtained in the sixth sub-module in the number of corresponding topics as an input of a corresponding text classifier obtained in the fourth sub-module, so as to obtain a category prediction result of each text classifier;
and the eighth submodule is used for obtaining a final classification result of the target text according to the class prediction result of each text classifier obtained by the seventh submodule and by combining the weight value pre-distributed to each text classifier.
In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects:
1. according to the method, the contribution difference of words with different parts of speech to semantic expression is fully considered, a training text set is divided into different word sets according to the combination of the parts of speech and the parts of speech, and the preprocessed word sets are further removed according to the word sets corresponding to the text classifier constructed in the step (4), so that the purpose of reducing dimensionality is achieved;
2. when the method is used for making a decision on the unknown text category, the differences of the contributions of different word sets to semantic expression are considered, meanwhile, the contributions of some word sets to semantic expression exceed the contributions of all word sets, and then a plurality of text training sets corresponding to a large macro F1 value are selected through the step (4), so that the accuracy of text classification and the generalization capability of a text classifier can be improved.
Drawings
FIG. 1 is a flow chart of a text classification method based on part of speech classification according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
As shown in fig. 1, the text classification method based on part-of-speech classification of the present invention includes:
the text classifier construction process specifically comprises the following steps:
(1) and acquiring a training text set and a test text set from the network, and preprocessing the training text set and the test text set to obtain a plurality of word sets of each text in the training text set and the test text set.
The test text set is a known category text set with more than or equal to 100 texts and is used for screening the text classifier in the subsequent step.
The step (1) specifically comprises the following substeps:
(1-1) performing word segmentation and part-of-speech tagging on each text in the training text set and the test text set respectively to obtain a mapping table between word segmentation results and part-of-speech;
specifically, the parts of speech are nouns, verbs, adjectives, adverbs, and the like.
For example, if there is a sentence in a certain text in the text set: "Engineers write handed books. ", the mapping table obtained after the processing in this step is:
engineer noun
Verb-to-write
Book noun with bottom of book
The specific process of word segmentation is as follows: for letter type languages, word segmentation mainly divides sentences into single words according to spaces; for Chinese, word segmentation is the segmentation of a sequence of Chinese characters into individual words.
More specifically, in the text classification method based on part-of-speech classification for Chinese in this step, the adopted Chinese word segmentation and part-of-speech tagging tool is a Chinese grammar analysis system NLPIR developed by the institute of computer technology of the Chinese research institute.
(1-2) removing stop words from the mapping table between the word segmentation result and the part of speech obtained in the step (1-1) to obtain an updated mapping table;
specifically, stop words include scrambles, English characters, single numbers, mathematical symbols, punctuation marks, and high frequency single words.
As for the example in the step (1-1), the results obtained after the treatment in this step are
Engineer noun
Verb-to-write
Book noun with bottom of book
(1-3) dividing the updated mapping table obtained in the step (1-2) according to the parts of speech, and removing the corresponding parts of speech to respectively obtain a plurality of word sets;
specifically, the obtained plurality of word sets include a noun word set, a verb word set, a noun verb combined word set, other word sets, and all word sets.
Wherein, the words of each text in the noun word set are nouns; correspondingly, the verb word set is all verbs, and the noun verb word set comprises nouns and verbs; other word sets include adjectives and adverbs. All word sets include nouns, verbs, adjectives, adverbs.
For the example in the step (1-1), the noun word set obtained in this step is
Engineer(s)
Book with crossed bottom
The resulting set of verb words is:
writing
The resulting noun action word combination word set is:
engineer(s)
Writing
Book with crossed bottom
The set of other words is empty and,
all the word sets are:
engineer(s)
Writing
Book with crossed bottom
The dividing process in the step is specifically to construct a regular expression which can be matched with words of different parts of speech in the text, and all words of the original text are divided directly according to the result of part of speech tagging.
(2) Training a text topic generation model (LDA) by taking a plurality of word sets (a name word set, a moving word set, a noun moving word combination set, other word sets and all word sets) of each text obtained in the step (1) as input so as to obtain a text-word set-topic mixed probability distribution model of each text under different topic numbers;
specifically, the method adopted for training the LDA model is the Gibbs algorithm, and the input of the algorithm is a plurality of word sets and hyper-parameters of each text in all training text sets. The topic features (i.e., probability distributions) for each set of words of each text are output for different number of topics.
Specifically, the iteration times of the hyper-parameters are more than 1000, the hyper-parameters of the LDA model mainly include α, β, K, iter _ number, where α is a prior parameter of text-topic distribution, β is a prior parameter of topic-word distribution, K is the number of topics set manually, and the value range is [10,150], where the step length is 10. iter _ number is the number of Gibbs sampling iterations.
It has been shown from a number of experiments that the topic distribution of the corpus tends to stabilize when α is 50/K, β is 0.01, and the number of iterations of Gibbs sampling is greater than or equal to 1000. In order to enable the Markov chain to have better convergence effect, the iteration number is 1500. The number of subjects K is a manually set value and needs to be determined based on experimental results.
(3) Performing classifier training on a plurality of text-word set-theme mixed probability distribution models of each word set in a training text set under different theme numbers by using an SVM-train function to obtain a plurality of trained classifiers;
(4) using a plurality of text-word set-topic mixed probability distribution models of each type of text in the test text set under different topic numbers as the input of the classifier trained in the step (3), performing SVM class prediction, obtaining macro F1 values of each word set under different topic numbers in the test text set according to SVM class prediction results and actual classes of each type of text in the test text set, selecting a plurality of maximum values (in the embodiment, the value range of the values is 1 or 3) from the obtained macro F1 values, and respectively establishing a plurality of text classifiers according to the text-word set-topic mixed probability distribution models, the corresponding word sets and the corresponding topic numbers corresponding to the values, wherein the number of the text classifiers is the same as the number of the selected values;
specifically, the category is a history-like text, an art-like text, a military-like text, or the like in the training data.
The SVM category prediction method comprises the steps of firstly selecting data of two categories, taking one category as a positive category and taking the other category as a negative category, and then training a classifier on the data of the two categories.
Obtaining the macro F1 value of the category prediction of each word set in the test text set under different topic numbers by adopting the following calculation formula:
Figure BDA0001680307410000101
n denotes the total number of categories of text, F1iF1 values representing the ith class, and having i ═ 1, n],
Wherein the calculation formula of the F1 value of the ith category is as follows:
Figure BDA0001680307410000102
wherein P isiFor accuracy, RiIn order to be able to recall the rate,
accuracy PiThe calculation formula of (a) is as follows:
Figure BDA0001680307410000103
recall rate RiThe calculation formula of (a) is as follows:
Figure BDA0001680307410000104
wherein, aiThe number of texts indicating that the prediction result of the SVM category of the ith text is C (wherein C represents a certain category) and the real category of the text is C; biThe number of texts which represent that the SVM category prediction result of the ith text is a category C and the real category of the text is not C; c. CiAnd the prediction result of the SVM category of the ith text is not the real category C of the text, and the real category of the text is the number of the texts of C.
Secondly, a text classification process specifically comprises the following steps:
(1') obtaining a target text to be classified, preprocessing the target text to obtain a plurality of word sets of the target text, and reserving the word sets corresponding to each text classifier obtained in the step (4) from the obtained plurality of word sets;
specifically, the preprocessing process in this step is substantially the same as that in step (1), and the only difference is that, on the basis of the processing result in step (1-3), the word sets corresponding to the plurality of text classifiers obtained in step (4) are retained from the processing result.
(2 ') training a text topic generation model (LDA) by taking the plurality of word sets of the target text reserved in the step (1') as input to obtain a text-word set-topic mixed probability distribution model of the plurality of word sets of the target text under the corresponding topic number;
specifically, the method adopted for training the LDA model is the Gibbs algorithm, which inputs a plurality of word sets and hyper-parameters of the target text. The topic characteristics (namely probability distribution) of each word set of the target text under different topic numbers are output.
Specifically, the iteration times of the hyper-parameters are more than 1000, the hyper-parameters of the LDA model mainly include α, β, K, iter _ number, where α is a prior parameter of text-topic distribution, β is a prior parameter of topic-word distribution, K is the number of topics set manually, and the value range is [10,150], where the step length is 10. iter _ number is the number of Gibbs sampling iterations.
It has been shown from a number of experiments that the topic distribution of the corpus tends to stabilize when α is 50/K, β is 0.01, and the number of iterations of Gibbs sampling is greater than or equal to 1000. In order to enable the Markov chain to have better convergence effect, the iteration number is 1500. The number of subjects K is a manually set value and needs to be determined based on experimental results.
(3 ') taking the mixed probability distribution model of the text, the word sets and the subjects of each word set of the target text obtained in the step (2') under the corresponding number of the subjects as the input of the corresponding text classifier obtained in the step (4) for class prediction to obtain a class prediction result of each text classifier;
and (4 ') obtaining a final classification result of the target text according to the class prediction result of each text classifier obtained in the step (3') and by combining the weight value pre-distributed to each text classifier.
Specifically, if the number of text classifiers obtained in step (4) is 1, the class prediction result obtained in step (3') is the final classification result of the target text.
If the text classifiers obtained in the step (4) are three (such as a noun classifier, a verb classifier, and an adjective classifier), then if the class prediction results of the three text classifiers obtained in the step (3') are all the same, the class prediction result is taken as the final classification result, if the class predictions of two of the text classifiers are the same, the class prediction results of the two text classifiers are taken as the final classification result, and if the class predictions of the three text classifiers are all different, the class prediction result of the text classifier corresponding to the maximum macro F1 value obtained in the step (4) is taken as the final classification result.
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (8)

1. A text classification method based on part of speech classification is characterized by comprising the following steps:
the text classifier construction process specifically comprises the following steps:
(1) acquiring a training text set and a test text set from a network, and preprocessing the training text set and the test text set to obtain a plurality of word sets of each text in the training text set and the test text set; the step (1) specifically comprises the following substeps:
(1-1) performing word segmentation and part-of-speech tagging on each text in the training text set and the test text set respectively to obtain a mapping table between word segmentation results and part-of-speech;
(1-2) removing stop words from the mapping table between the word segmentation result and the part of speech obtained in the step (1-1) to obtain an updated mapping table;
(1-3) dividing the updated mapping table obtained in the step (1-2) according to the parts of speech, and removing the corresponding parts of speech to respectively obtain a plurality of word sets; the dividing process of the step is specifically that a regular expression which can be matched with words of different parts of speech in the text is constructed, and all words of the original text are divided directly according to the result of part of speech tagging;
(2) training a text topic generation model LDA by taking the plurality of word sets of each text obtained in the step (1) as input to obtain a text-word set-topic mixed probability distribution model of each text under different topic numbers;
(3) performing classifier training on a plurality of text-word set-theme mixed probability distribution models of each word set in a training text set under different theme numbers by using an SVM-train function to obtain a plurality of trained classifiers;
(4) using a plurality of text-word set-topic mixed probability distribution models of each type of text of the test text set under different topic numbers as the input of the classifier trained in the step (3), performing SVM class prediction, obtaining macro F1 values of each word set under different topic numbers in the test text set according to SVM class prediction results and actual classes of each type of text in the test text set, selecting a maximum plurality of values from the obtained macro F1 values, and respectively establishing a plurality of text classifiers according to the text-word set-topic mixed probability distribution models, corresponding word sets and corresponding topic numbers corresponding to the values;
secondly, a text classification process specifically comprises the following steps:
(1') obtaining a target text to be classified, preprocessing the target text to obtain a plurality of word sets of the target text, and reserving the word sets corresponding to each text classifier obtained in the step (4) from the obtained plurality of word sets;
(2 ') training a text topic generation model LDA by taking the plurality of word sets of the target text reserved in the step (1') as input to obtain a text-word set-topic mixed probability distribution model of the plurality of word sets of the target text under the corresponding topic number;
(3 ') performing category prediction by taking the text-word set-topic mixed probability distribution model of each word set of the target text obtained in the step (2') under the corresponding topic number as the input of the corresponding text classifier obtained in the step (4) to obtain a category prediction result of each text classifier;
and (4 ') obtaining a final classification result of the target text according to the class prediction result of each text classifier obtained in the step (3') and by combining the weight value pre-distributed to each text classifier.
2. The method for classifying text based on part-of-speech classification as claimed in claim 1, wherein the plurality of word sets obtained in step (1-3) include a noun word set, a verb word set, a noun verb word set, other word sets, and all word sets.
3. The method of claim 1, wherein the LDA model is trained by a Gibbs algorithm, the input of which is a plurality of word sets and hyper-parameters of each text in all training text sets; and outputting probability distribution of each word set of each text under different topic numbers.
4. The method of claim 1, wherein the function used for SVM class prediction is a SVM-prediction function based on LIBSVM tool, and the SVM class prediction algorithm selects a one-to-one SVM multi-class classification algorithm.
5. The method for classifying texts based on part-of-speech classification as claimed in claim 1, wherein the macro F1 values of class predictions of each word set in the test text set under different topic numbers are obtained by using the following calculation formula:
Figure FDA0002683828940000031
where n denotes the total number of categories of the test text, F1iF1 values representing the ith class, and having i ═ 1, n]。
6. The method for classifying text according to part of speech according to claim 5, wherein the F1 value of the ith category is calculated as follows:
Figure FDA0002683828940000032
wherein P isiFor accuracy, RiIn order to be able to recall the rate,
accuracy PiThe calculation formula of (a) is as follows:
Figure FDA0002683828940000033
recall rate RiThe calculation formula of (a) is as follows:
Figure FDA0002683828940000034
wherein, aiThe number of texts which represent that the prediction result of the SVM category of the ith test text set is C and the real category of the text is C, wherein C represents a certain category; biThe SVM category prediction result of the ith test text set is represented as a category C, and the actual category of the text is not the number of the texts of C; c. CiAnd the prediction result of the SVM category representing the ith test text set is not the real category C of the text, and the real category of the text is the number of texts of C.
7. The method for classifying texts based on parts of speech classification as claimed in claim 1, wherein the step (4 ') is specifically configured such that if the number of the text classifiers obtained in the step (4) is 1, the classification prediction result obtained in the step (3') is the final classification result of the target text; if the number of the text classifiers obtained in the step (4) is three, in this case, if the class prediction results of the three text classifiers obtained in the step (3') are all the same, the class prediction result is taken as the final classification result, if the class predictions of two of the text classifiers are the same, the class prediction results of the two text classifiers are taken as the final classification result, and if the class predictions of the three text classifiers are all different, the class prediction result of the text classifier corresponding to the maximum macro F1 value obtained in the step (4) is taken as the final classification result.
8. A system for classifying text based on part-of-speech classification, comprising:
the text classifier building module specifically comprises:
the first sub-module is used for acquiring a training text set and a test text set from a network, and preprocessing the training text set and the test text set to acquire a plurality of word sets of each text in the training text set and the test text set; the module specifically comprises the following subunits:
the first subunit is used for respectively carrying out word segmentation and part-of-speech tagging on each text in the training text set and the test text set so as to obtain a mapping table between word segmentation results and part-of-speech;
the second subunit is used for eliminating stop words from the mapping table between the word segmentation result and the part of speech obtained in the first subunit to obtain an updated mapping table;
the third subunit is used for dividing the updated mapping table obtained by the second subunit according to the part of speech and removing the corresponding part of speech to obtain a plurality of word sets respectively; the sub-unit dividing process specifically includes constructing a regular expression capable of matching different part-of-speech words in a text, and directly dividing all words of an original text according to part-of-speech tagging results;
the second submodule is used for training a text theme generation model LDA by taking the plurality of word sets of each text obtained by the first submodule as input so as to obtain a text-word set-theme mixed probability distribution model of each text under different theme numbers;
the third submodule is used for carrying out classifier training on a mixed probability distribution model of a plurality of texts, word sets and topics of each word set in a training text set under different topic numbers by using an SVM-train function so as to obtain a plurality of trained classifiers;
a fourth sub-module, configured to use a multiple text-word set-topic mixed probability distribution model of each type of text in the test text set under different topic numbers as an input of the classifier trained in the third sub-module, perform SVM category prediction, obtain a macro F1 value of each word set under different topic numbers in the test text set according to an SVM category prediction result and an actual category of each type of text in the test text set, select a maximum multiple values from the obtained macro F1 values, and respectively establish multiple text classifiers according to a text-word set-topic mixed probability distribution model corresponding to the multiple values, a corresponding word set, and a corresponding topic number;
the text classification module specifically comprises:
the fifth sub-module is used for acquiring a target text to be classified, preprocessing the target text to acquire a plurality of word sets of the target text, and reserving the word sets corresponding to each text classifier acquired in the fourth sub-module from the acquired plurality of word sets;
the sixth submodule is used for training the text topic generation model LDA by taking the plurality of word sets of the target text reserved in the fifth submodule as input so as to obtain a text-word set-topic mixed probability distribution model of the plurality of word sets of the target text under the corresponding topic number;
a seventh sub-module, configured to perform category prediction on a text-word set-topic mixed probability distribution model of each word set of the target text obtained in the sixth sub-module in the number of corresponding topics as an input of a corresponding text classifier obtained in the fourth sub-module, so as to obtain a category prediction result of each text classifier;
and the eighth submodule is used for obtaining a final classification result of the target text according to the class prediction result of each text classifier obtained by the seventh submodule and by combining the weight value pre-distributed to each text classifier.
CN201810551315.3A 2018-05-31 2018-05-31 Text classification method and system based on part-of-speech classification Active CN108763539B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810551315.3A CN108763539B (en) 2018-05-31 2018-05-31 Text classification method and system based on part-of-speech classification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810551315.3A CN108763539B (en) 2018-05-31 2018-05-31 Text classification method and system based on part-of-speech classification

Publications (2)

Publication Number Publication Date
CN108763539A CN108763539A (en) 2018-11-06
CN108763539B true CN108763539B (en) 2020-11-10

Family

ID=64001297

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810551315.3A Active CN108763539B (en) 2018-05-31 2018-05-31 Text classification method and system based on part-of-speech classification

Country Status (1)

Country Link
CN (1) CN108763539B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110032639B (en) * 2018-12-27 2023-10-31 中国银联股份有限公司 Method, device and storage medium for matching semantic text data with tag
CN110413773B (en) * 2019-06-20 2023-09-22 平安科技(深圳)有限公司 Intelligent text classification method, device and computer readable storage medium
CN112184133A (en) * 2019-07-02 2021-01-05 黎嘉明 Artificial intelligence-based government office system preset approval and division method
CN111090746B (en) * 2019-11-29 2023-04-28 北京明略软件系统有限公司 Method for determining optimal topic quantity, training method and device for emotion classifier
CN111723206B (en) * 2020-06-19 2024-01-19 北京明略软件系统有限公司 Text classification method, apparatus, computer device and storage medium
CN113761911A (en) * 2021-03-17 2021-12-07 中科天玑数据科技股份有限公司 Domain text labeling method based on weak supervision

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6085888B2 (en) * 2014-08-28 2017-03-01 有限責任監査法人トーマツ Analysis method, analysis apparatus, and analysis program
CN107291795B (en) * 2017-05-03 2020-06-19 华南理工大学 Text classification method combining dynamic word embedding and part-of-speech tagging

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
《结合语义扩展和卷积神经网络的中文短文本分类方法》;卢玲 等;《计算机应用》;20171210(第12期);第3498-3503页 *

Also Published As

Publication number Publication date
CN108763539A (en) 2018-11-06

Similar Documents

Publication Publication Date Title
CN108763539B (en) Text classification method and system based on part-of-speech classification
US10607598B1 (en) Determining input data for speech processing
Vosoughi et al. Tweet acts: A speech act classifier for twitter
Alwehaibi et al. Comparison of pre-trained word vectors for arabic text classification using deep learning approach
WO2019000170A1 (en) Generating responses in automated chatting
Fahad et al. Inflectional review of deep learning on natural language processing
Fonseca et al. A two-step convolutional neural network approach for semantic role labeling
CN111858935A (en) Fine-grained emotion classification system for flight comment
Agrawal et al. Affective representations for sarcasm detection
CN112052331A (en) Method and terminal for processing text information
CN114528919A (en) Natural language processing method and device and computer equipment
CN110705247A (en) Based on x2-C text similarity calculation method
CN111339772B (en) Russian text emotion analysis method, electronic device and storage medium
Tachicart et al. Automatic identification of Moroccan colloquial Arabic
CN109543036A (en) Text Clustering Method based on semantic similarity
Nerabie et al. The impact of Arabic part of speech tagging on sentiment analysis: A new corpus and deep learning approach
Yousif Hidden Markov Model tagger for applications based Arabic text: A review
Errami et al. Sentiment Analysis onMoroccan Dialect based on ML and Social Media Content Detection
CN110728144A (en) Extraction type document automatic summarization method based on context semantic perception
CN113065350A (en) Biomedical text word sense disambiguation method based on attention neural network
Putra et al. Sentiment Analysis on Social Media with Glove Using Combination CNN and RoBERTa
Zhu et al. YUN111@ Dravidian-CodeMix-FIRE2020: Sentiment Analysis of Dravidian Code Mixed Text.
CN107729509B (en) Discourse similarity determination method based on recessive high-dimensional distributed feature representation
KR20200040032A (en) A method ofr classification of korean postings based on bidirectional lstm-attention
US20220366893A1 (en) Systems and methods for few-shot intent classifier models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant