CN108763539B - Text classification method and system based on part-of-speech classification - Google Patents
Text classification method and system based on part-of-speech classification Download PDFInfo
- Publication number
- CN108763539B CN108763539B CN201810551315.3A CN201810551315A CN108763539B CN 108763539 B CN108763539 B CN 108763539B CN 201810551315 A CN201810551315 A CN 201810551315A CN 108763539 B CN108763539 B CN 108763539B
- Authority
- CN
- China
- Prior art keywords
- text
- word
- topic
- training
- speech
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Abstract
The invention discloses a text classification method based on part of speech classification, which comprises the following steps: the method comprises the steps of obtaining a training text set and a testing text set from a network, preprocessing the training text set and the testing text set to obtain a plurality of word sets of each text in the training text set and the testing text set, training a text topic generation model LDA by taking the obtained plurality of word sets of each text as input to obtain a text-word set-topic mixed probability distribution model of each text under different topic numbers, carrying out classifier training on the plurality of text-word set-topic mixed probability distribution models by using an SVM-trail function to obtain a plurality of trained classifiers, and carrying out SVM classification prediction by taking the plurality of text-word set-topic mixed probability distribution models as input of the trained classifiers. The method can solve the technical problems of high dimensionality of the required feature words, low classification accuracy and poor generalization capability of the classifier in the prior art during model training.
Description
Technical Field
The invention belongs to the technical field of computer deep learning, and particularly relates to a text classification method and system based on part of speech classification.
Background
As various social software and self-media software are widely used, data generated by internet platforms every day, which mainly include pictures, voices, texts, etc., among which texts are the main, is also rapidly growing. In order to classify the mass data, the mass data needs to be screened and extracted manually, which consumes a lot of time and effort, and the classification effect is not satisfactory. Text classification techniques have been developed to improve the efficiency and accuracy of text classification.
The existing text classification method mainly adopts a text topic generation model (LDA), which can be used to identify Latent topic information in a large-scale Document collection (Document collection) or Corpus (Corpus), and adopts a bag of words (bag of words) method, which treats each Document as a word frequency vector, thereby converting text information into digital information easy to model.
However, the above method has some non-negligible drawbacks: firstly, the method needs the participation of all word sets no matter in the stage of training a classifier or in the stage of classification, so that the dimensionality of the required feature words is high when a model is trained; in addition, the method does not take the contribution difference of different parts of speech and part of speech combinations to the text classification into consideration, so that the classification accuracy is low, and the generalization capability of the classifier is poor.
Disclosure of Invention
In view of the above defects or improvement needs of the prior art, the present invention provides a text classification method and system based on part-of-speech classification, and aims to solve the technical problems of high dimensionality of the required feature words, low classification accuracy and poor generalization capability of the classifier in the prior art when training a model.
To achieve the above object, according to an aspect of the present invention, there is provided a text classification method based on part-of-speech classification, including:
the text classifier construction process specifically comprises the following steps:
(1) and acquiring a training text set and a test text set from the network, and preprocessing the training text set and the test text set to obtain a plurality of word sets of each text in the training text set and the test text set.
(2) Training a text topic generation model LDA by taking the plurality of word sets of each text obtained in the step (1) as input to obtain a text-word set-topic mixed probability distribution model of each text under different topic numbers;
(3) performing classifier training on a plurality of text-word set-theme mixed probability distribution models of each word set in a training text set under different theme numbers by using an SVM-train function to obtain a plurality of trained classifiers;
(4) using a plurality of text-word set-topic mixed probability distribution models of each type of text of the test text set under different topic numbers as the input of the classifier trained in the step (3), performing SVM class prediction, obtaining macro F1 values of each word set under different topic numbers in the test text set according to SVM class prediction results and actual classes of each type of text in the test text set, selecting a maximum plurality of values from the obtained macro F1 values, and respectively establishing a plurality of text classifiers according to the text-word set-topic mixed probability distribution models, corresponding word sets and corresponding topic numbers corresponding to the values;
secondly, a text classification process specifically comprises the following steps:
(1') obtaining a target text to be classified, preprocessing the target text to obtain a plurality of word sets of the target text, and reserving the word sets corresponding to each text classifier obtained in the step (4) from the obtained plurality of word sets;
(2 ') training a text topic generation model LDA by taking the plurality of word sets of the target text reserved in the step (1') as input to obtain a text-word set-topic mixed probability distribution model of the plurality of word sets of the target text under the corresponding topic number;
(3 ') performing category prediction by taking the text-word set-topic mixed probability distribution model of each word set of the target text obtained in the step (2') under the corresponding topic number as the input of the corresponding text classifier obtained in the step (4) to obtain a category prediction result of each text classifier;
and (4 ') obtaining a final classification result of the target text according to the class prediction result of each text classifier obtained in the step (3') and by combining the weight value pre-distributed to each text classifier.
Preferably, step (1) comprises in particular the following sub-steps:
(1-1) performing word segmentation and part-of-speech tagging on each text in the training text set and the test text set respectively to obtain a mapping table between word segmentation results and part-of-speech;
(1-2) removing stop words from the mapping table between the word segmentation result and the part of speech obtained in the step (1-1) to obtain an updated mapping table;
and (1-3) dividing the updated mapping table obtained in the step (1-2) according to the parts of speech, and removing the corresponding parts of speech to respectively obtain a plurality of word sets.
Preferably, the plurality of word sets obtained in step (1-3) include a noun word set, a verb word set, a noun verb word set, other word sets, and all word sets.
Preferably, the method used for training the LDA-model is the Gibbs algorithm, the input of which is a plurality of word sets and hyper-parameters for each text in all training text sets. And outputting probability distribution of each word set of each text under different topic numbers.
Preferably, the function used for SVM category prediction is a SVM-prediction function based on the LIBSVM tool, and the SVM category prediction algorithm selects a one-to-one SVM multi-class classification algorithm.
Preferably, the macro F1 values of category predictions of each word set in the test text set under different topic numbers are obtained by using the following calculation formula:
where n represents the total number of categories of text, F1iF1 values representing the ith class, and having i ═ 1, n]。
Preferably, the calculation formula of the F1 value of the ith category is as follows:
wherein P isiFor accuracy, RiIn order to be able to recall the rate,
accuracy PiThe calculation formula of (a) is as follows:
recall rate RiThe calculation formula of (a) is as follows:
wherein, aiThe number of the texts which represent that the prediction result of the SVM category of the ith text is C and the real category of the text is C, wherein C represents a certain category; biThe number of texts which represent that the SVM category prediction result of the ith text is a category C and the real category of the text is not C; c. CiAnd the prediction result of the SVM category of the ith text is not the real category C of the text, and the real category of the text is the number of the texts of C.
Preferably, in the step (4 '), if the number of the text classifiers obtained in the step (4) is 1, the class prediction result obtained in the step (3') is the final classification result of the target text; if the number of the text classifiers obtained in the step (4) is three, in this case, if the class prediction results of the three text classifiers obtained in the step (3') are all the same, the class prediction result is taken as the final classification result, if the class predictions of two of the text classifiers are the same, the class prediction results of the two text classifiers are taken as the final classification result, and if the class predictions of the three text classifiers are all different, the class prediction result of the text classifier corresponding to the maximum macro F1 value obtained in the step (4) is taken as the final classification result.
According to another aspect of the present invention, there is provided a text classification system based on part-of-speech classification, including:
the text classifier building module specifically comprises:
the first sub-module is used for acquiring a training text set and a test text set from a network, and preprocessing the training text set and the test text set to acquire a plurality of word sets of each text in the training text set and the test text set.
The second submodule is used for training a text theme generation model LDA by taking the plurality of word sets of each text obtained by the first submodule as input so as to obtain a text-word set-theme mixed probability distribution model of each text under different theme numbers;
the third submodule is used for carrying out classifier training on a mixed probability distribution model of a plurality of texts, word sets and topics of each word set in a training text set under different topic numbers by using an SVM-train function so as to obtain a plurality of trained classifiers;
a fourth sub-module, configured to use a multiple text-word set-topic mixed probability distribution model of each type of text in the test text set under different topic numbers as an input of the classifier trained in the third sub-module, perform SVM category prediction, obtain a macro F1 value of each word set under different topic numbers in the test text set according to an SVM category prediction result and an actual category of each type of text in the test text set, select a maximum multiple values from the obtained macro F1 values, and respectively establish multiple text classifiers according to a text-word set-topic mixed probability distribution model corresponding to the multiple values, a corresponding word set, and a corresponding topic number;
the text classification module specifically comprises:
the fifth sub-module is used for acquiring a target text to be classified, preprocessing the target text to acquire a plurality of word sets of the target text, and reserving the word sets corresponding to each text classifier acquired in the fourth sub-module from the acquired plurality of word sets;
the sixth submodule is used for training the text topic generation model LDA by taking the plurality of word sets of the target text reserved in the fifth submodule as input so as to obtain a text-word set-topic mixed probability distribution model of the plurality of word sets of the target text under the corresponding topic number;
a seventh sub-module, configured to perform category prediction on a text-word set-topic mixed probability distribution model of each word set of the target text obtained in the sixth sub-module in the number of corresponding topics as an input of a corresponding text classifier obtained in the fourth sub-module, so as to obtain a category prediction result of each text classifier;
and the eighth submodule is used for obtaining a final classification result of the target text according to the class prediction result of each text classifier obtained by the seventh submodule and by combining the weight value pre-distributed to each text classifier.
In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects:
1. according to the method, the contribution difference of words with different parts of speech to semantic expression is fully considered, a training text set is divided into different word sets according to the combination of the parts of speech and the parts of speech, and the preprocessed word sets are further removed according to the word sets corresponding to the text classifier constructed in the step (4), so that the purpose of reducing dimensionality is achieved;
2. when the method is used for making a decision on the unknown text category, the differences of the contributions of different word sets to semantic expression are considered, meanwhile, the contributions of some word sets to semantic expression exceed the contributions of all word sets, and then a plurality of text training sets corresponding to a large macro F1 value are selected through the step (4), so that the accuracy of text classification and the generalization capability of a text classifier can be improved.
Drawings
FIG. 1 is a flow chart of a text classification method based on part of speech classification according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
As shown in fig. 1, the text classification method based on part-of-speech classification of the present invention includes:
the text classifier construction process specifically comprises the following steps:
(1) and acquiring a training text set and a test text set from the network, and preprocessing the training text set and the test text set to obtain a plurality of word sets of each text in the training text set and the test text set.
The test text set is a known category text set with more than or equal to 100 texts and is used for screening the text classifier in the subsequent step.
The step (1) specifically comprises the following substeps:
(1-1) performing word segmentation and part-of-speech tagging on each text in the training text set and the test text set respectively to obtain a mapping table between word segmentation results and part-of-speech;
specifically, the parts of speech are nouns, verbs, adjectives, adverbs, and the like.
For example, if there is a sentence in a certain text in the text set: "Engineers write handed books. ", the mapping table obtained after the processing in this step is:
engineer noun
Verb-to-write
Book noun with bottom of book
The specific process of word segmentation is as follows: for letter type languages, word segmentation mainly divides sentences into single words according to spaces; for Chinese, word segmentation is the segmentation of a sequence of Chinese characters into individual words.
More specifically, in the text classification method based on part-of-speech classification for Chinese in this step, the adopted Chinese word segmentation and part-of-speech tagging tool is a Chinese grammar analysis system NLPIR developed by the institute of computer technology of the Chinese research institute.
(1-2) removing stop words from the mapping table between the word segmentation result and the part of speech obtained in the step (1-1) to obtain an updated mapping table;
specifically, stop words include scrambles, English characters, single numbers, mathematical symbols, punctuation marks, and high frequency single words.
As for the example in the step (1-1), the results obtained after the treatment in this step are
Engineer noun
Verb-to-write
Book noun with bottom of book
(1-3) dividing the updated mapping table obtained in the step (1-2) according to the parts of speech, and removing the corresponding parts of speech to respectively obtain a plurality of word sets;
specifically, the obtained plurality of word sets include a noun word set, a verb word set, a noun verb combined word set, other word sets, and all word sets.
Wherein, the words of each text in the noun word set are nouns; correspondingly, the verb word set is all verbs, and the noun verb word set comprises nouns and verbs; other word sets include adjectives and adverbs. All word sets include nouns, verbs, adjectives, adverbs.
For the example in the step (1-1), the noun word set obtained in this step is
Engineer(s)
Book with crossed bottom
The resulting set of verb words is:
writing
The resulting noun action word combination word set is:
engineer(s)
Writing
Book with crossed bottom
The set of other words is empty and,
all the word sets are:
engineer(s)
Writing
Book with crossed bottom
The dividing process in the step is specifically to construct a regular expression which can be matched with words of different parts of speech in the text, and all words of the original text are divided directly according to the result of part of speech tagging.
(2) Training a text topic generation model (LDA) by taking a plurality of word sets (a name word set, a moving word set, a noun moving word combination set, other word sets and all word sets) of each text obtained in the step (1) as input so as to obtain a text-word set-topic mixed probability distribution model of each text under different topic numbers;
specifically, the method adopted for training the LDA model is the Gibbs algorithm, and the input of the algorithm is a plurality of word sets and hyper-parameters of each text in all training text sets. The topic features (i.e., probability distributions) for each set of words of each text are output for different number of topics.
Specifically, the iteration times of the hyper-parameters are more than 1000, the hyper-parameters of the LDA model mainly include α, β, K, iter _ number, where α is a prior parameter of text-topic distribution, β is a prior parameter of topic-word distribution, K is the number of topics set manually, and the value range is [10,150], where the step length is 10. iter _ number is the number of Gibbs sampling iterations.
It has been shown from a number of experiments that the topic distribution of the corpus tends to stabilize when α is 50/K, β is 0.01, and the number of iterations of Gibbs sampling is greater than or equal to 1000. In order to enable the Markov chain to have better convergence effect, the iteration number is 1500. The number of subjects K is a manually set value and needs to be determined based on experimental results.
(3) Performing classifier training on a plurality of text-word set-theme mixed probability distribution models of each word set in a training text set under different theme numbers by using an SVM-train function to obtain a plurality of trained classifiers;
(4) using a plurality of text-word set-topic mixed probability distribution models of each type of text in the test text set under different topic numbers as the input of the classifier trained in the step (3), performing SVM class prediction, obtaining macro F1 values of each word set under different topic numbers in the test text set according to SVM class prediction results and actual classes of each type of text in the test text set, selecting a plurality of maximum values (in the embodiment, the value range of the values is 1 or 3) from the obtained macro F1 values, and respectively establishing a plurality of text classifiers according to the text-word set-topic mixed probability distribution models, the corresponding word sets and the corresponding topic numbers corresponding to the values, wherein the number of the text classifiers is the same as the number of the selected values;
specifically, the category is a history-like text, an art-like text, a military-like text, or the like in the training data.
The SVM category prediction method comprises the steps of firstly selecting data of two categories, taking one category as a positive category and taking the other category as a negative category, and then training a classifier on the data of the two categories.
Obtaining the macro F1 value of the category prediction of each word set in the test text set under different topic numbers by adopting the following calculation formula:
n denotes the total number of categories of text, F1iF1 values representing the ith class, and having i ═ 1, n],
Wherein the calculation formula of the F1 value of the ith category is as follows:
wherein P isiFor accuracy, RiIn order to be able to recall the rate,
accuracy PiThe calculation formula of (a) is as follows:
recall rate RiThe calculation formula of (a) is as follows:
wherein, aiThe number of texts indicating that the prediction result of the SVM category of the ith text is C (wherein C represents a certain category) and the real category of the text is C; biThe number of texts which represent that the SVM category prediction result of the ith text is a category C and the real category of the text is not C; c. CiAnd the prediction result of the SVM category of the ith text is not the real category C of the text, and the real category of the text is the number of the texts of C.
Secondly, a text classification process specifically comprises the following steps:
(1') obtaining a target text to be classified, preprocessing the target text to obtain a plurality of word sets of the target text, and reserving the word sets corresponding to each text classifier obtained in the step (4) from the obtained plurality of word sets;
specifically, the preprocessing process in this step is substantially the same as that in step (1), and the only difference is that, on the basis of the processing result in step (1-3), the word sets corresponding to the plurality of text classifiers obtained in step (4) are retained from the processing result.
(2 ') training a text topic generation model (LDA) by taking the plurality of word sets of the target text reserved in the step (1') as input to obtain a text-word set-topic mixed probability distribution model of the plurality of word sets of the target text under the corresponding topic number;
specifically, the method adopted for training the LDA model is the Gibbs algorithm, which inputs a plurality of word sets and hyper-parameters of the target text. The topic characteristics (namely probability distribution) of each word set of the target text under different topic numbers are output.
Specifically, the iteration times of the hyper-parameters are more than 1000, the hyper-parameters of the LDA model mainly include α, β, K, iter _ number, where α is a prior parameter of text-topic distribution, β is a prior parameter of topic-word distribution, K is the number of topics set manually, and the value range is [10,150], where the step length is 10. iter _ number is the number of Gibbs sampling iterations.
It has been shown from a number of experiments that the topic distribution of the corpus tends to stabilize when α is 50/K, β is 0.01, and the number of iterations of Gibbs sampling is greater than or equal to 1000. In order to enable the Markov chain to have better convergence effect, the iteration number is 1500. The number of subjects K is a manually set value and needs to be determined based on experimental results.
(3 ') taking the mixed probability distribution model of the text, the word sets and the subjects of each word set of the target text obtained in the step (2') under the corresponding number of the subjects as the input of the corresponding text classifier obtained in the step (4) for class prediction to obtain a class prediction result of each text classifier;
and (4 ') obtaining a final classification result of the target text according to the class prediction result of each text classifier obtained in the step (3') and by combining the weight value pre-distributed to each text classifier.
Specifically, if the number of text classifiers obtained in step (4) is 1, the class prediction result obtained in step (3') is the final classification result of the target text.
If the text classifiers obtained in the step (4) are three (such as a noun classifier, a verb classifier, and an adjective classifier), then if the class prediction results of the three text classifiers obtained in the step (3') are all the same, the class prediction result is taken as the final classification result, if the class predictions of two of the text classifiers are the same, the class prediction results of the two text classifiers are taken as the final classification result, and if the class predictions of the three text classifiers are all different, the class prediction result of the text classifier corresponding to the maximum macro F1 value obtained in the step (4) is taken as the final classification result.
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.
Claims (8)
1. A text classification method based on part of speech classification is characterized by comprising the following steps:
the text classifier construction process specifically comprises the following steps:
(1) acquiring a training text set and a test text set from a network, and preprocessing the training text set and the test text set to obtain a plurality of word sets of each text in the training text set and the test text set; the step (1) specifically comprises the following substeps:
(1-1) performing word segmentation and part-of-speech tagging on each text in the training text set and the test text set respectively to obtain a mapping table between word segmentation results and part-of-speech;
(1-2) removing stop words from the mapping table between the word segmentation result and the part of speech obtained in the step (1-1) to obtain an updated mapping table;
(1-3) dividing the updated mapping table obtained in the step (1-2) according to the parts of speech, and removing the corresponding parts of speech to respectively obtain a plurality of word sets; the dividing process of the step is specifically that a regular expression which can be matched with words of different parts of speech in the text is constructed, and all words of the original text are divided directly according to the result of part of speech tagging;
(2) training a text topic generation model LDA by taking the plurality of word sets of each text obtained in the step (1) as input to obtain a text-word set-topic mixed probability distribution model of each text under different topic numbers;
(3) performing classifier training on a plurality of text-word set-theme mixed probability distribution models of each word set in a training text set under different theme numbers by using an SVM-train function to obtain a plurality of trained classifiers;
(4) using a plurality of text-word set-topic mixed probability distribution models of each type of text of the test text set under different topic numbers as the input of the classifier trained in the step (3), performing SVM class prediction, obtaining macro F1 values of each word set under different topic numbers in the test text set according to SVM class prediction results and actual classes of each type of text in the test text set, selecting a maximum plurality of values from the obtained macro F1 values, and respectively establishing a plurality of text classifiers according to the text-word set-topic mixed probability distribution models, corresponding word sets and corresponding topic numbers corresponding to the values;
secondly, a text classification process specifically comprises the following steps:
(1') obtaining a target text to be classified, preprocessing the target text to obtain a plurality of word sets of the target text, and reserving the word sets corresponding to each text classifier obtained in the step (4) from the obtained plurality of word sets;
(2 ') training a text topic generation model LDA by taking the plurality of word sets of the target text reserved in the step (1') as input to obtain a text-word set-topic mixed probability distribution model of the plurality of word sets of the target text under the corresponding topic number;
(3 ') performing category prediction by taking the text-word set-topic mixed probability distribution model of each word set of the target text obtained in the step (2') under the corresponding topic number as the input of the corresponding text classifier obtained in the step (4) to obtain a category prediction result of each text classifier;
and (4 ') obtaining a final classification result of the target text according to the class prediction result of each text classifier obtained in the step (3') and by combining the weight value pre-distributed to each text classifier.
2. The method for classifying text based on part-of-speech classification as claimed in claim 1, wherein the plurality of word sets obtained in step (1-3) include a noun word set, a verb word set, a noun verb word set, other word sets, and all word sets.
3. The method of claim 1, wherein the LDA model is trained by a Gibbs algorithm, the input of which is a plurality of word sets and hyper-parameters of each text in all training text sets; and outputting probability distribution of each word set of each text under different topic numbers.
4. The method of claim 1, wherein the function used for SVM class prediction is a SVM-prediction function based on LIBSVM tool, and the SVM class prediction algorithm selects a one-to-one SVM multi-class classification algorithm.
5. The method for classifying texts based on part-of-speech classification as claimed in claim 1, wherein the macro F1 values of class predictions of each word set in the test text set under different topic numbers are obtained by using the following calculation formula:
where n denotes the total number of categories of the test text, F1iF1 values representing the ith class, and having i ═ 1, n]。
6. The method for classifying text according to part of speech according to claim 5, wherein the F1 value of the ith category is calculated as follows:
wherein P isiFor accuracy, RiIn order to be able to recall the rate,
accuracy PiThe calculation formula of (a) is as follows:
recall rate RiThe calculation formula of (a) is as follows:
wherein, aiThe number of texts which represent that the prediction result of the SVM category of the ith test text set is C and the real category of the text is C, wherein C represents a certain category; biThe SVM category prediction result of the ith test text set is represented as a category C, and the actual category of the text is not the number of the texts of C; c. CiAnd the prediction result of the SVM category representing the ith test text set is not the real category C of the text, and the real category of the text is the number of texts of C.
7. The method for classifying texts based on parts of speech classification as claimed in claim 1, wherein the step (4 ') is specifically configured such that if the number of the text classifiers obtained in the step (4) is 1, the classification prediction result obtained in the step (3') is the final classification result of the target text; if the number of the text classifiers obtained in the step (4) is three, in this case, if the class prediction results of the three text classifiers obtained in the step (3') are all the same, the class prediction result is taken as the final classification result, if the class predictions of two of the text classifiers are the same, the class prediction results of the two text classifiers are taken as the final classification result, and if the class predictions of the three text classifiers are all different, the class prediction result of the text classifier corresponding to the maximum macro F1 value obtained in the step (4) is taken as the final classification result.
8. A system for classifying text based on part-of-speech classification, comprising:
the text classifier building module specifically comprises:
the first sub-module is used for acquiring a training text set and a test text set from a network, and preprocessing the training text set and the test text set to acquire a plurality of word sets of each text in the training text set and the test text set; the module specifically comprises the following subunits:
the first subunit is used for respectively carrying out word segmentation and part-of-speech tagging on each text in the training text set and the test text set so as to obtain a mapping table between word segmentation results and part-of-speech;
the second subunit is used for eliminating stop words from the mapping table between the word segmentation result and the part of speech obtained in the first subunit to obtain an updated mapping table;
the third subunit is used for dividing the updated mapping table obtained by the second subunit according to the part of speech and removing the corresponding part of speech to obtain a plurality of word sets respectively; the sub-unit dividing process specifically includes constructing a regular expression capable of matching different part-of-speech words in a text, and directly dividing all words of an original text according to part-of-speech tagging results;
the second submodule is used for training a text theme generation model LDA by taking the plurality of word sets of each text obtained by the first submodule as input so as to obtain a text-word set-theme mixed probability distribution model of each text under different theme numbers;
the third submodule is used for carrying out classifier training on a mixed probability distribution model of a plurality of texts, word sets and topics of each word set in a training text set under different topic numbers by using an SVM-train function so as to obtain a plurality of trained classifiers;
a fourth sub-module, configured to use a multiple text-word set-topic mixed probability distribution model of each type of text in the test text set under different topic numbers as an input of the classifier trained in the third sub-module, perform SVM category prediction, obtain a macro F1 value of each word set under different topic numbers in the test text set according to an SVM category prediction result and an actual category of each type of text in the test text set, select a maximum multiple values from the obtained macro F1 values, and respectively establish multiple text classifiers according to a text-word set-topic mixed probability distribution model corresponding to the multiple values, a corresponding word set, and a corresponding topic number;
the text classification module specifically comprises:
the fifth sub-module is used for acquiring a target text to be classified, preprocessing the target text to acquire a plurality of word sets of the target text, and reserving the word sets corresponding to each text classifier acquired in the fourth sub-module from the acquired plurality of word sets;
the sixth submodule is used for training the text topic generation model LDA by taking the plurality of word sets of the target text reserved in the fifth submodule as input so as to obtain a text-word set-topic mixed probability distribution model of the plurality of word sets of the target text under the corresponding topic number;
a seventh sub-module, configured to perform category prediction on a text-word set-topic mixed probability distribution model of each word set of the target text obtained in the sixth sub-module in the number of corresponding topics as an input of a corresponding text classifier obtained in the fourth sub-module, so as to obtain a category prediction result of each text classifier;
and the eighth submodule is used for obtaining a final classification result of the target text according to the class prediction result of each text classifier obtained by the seventh submodule and by combining the weight value pre-distributed to each text classifier.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810551315.3A CN108763539B (en) | 2018-05-31 | 2018-05-31 | Text classification method and system based on part-of-speech classification |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810551315.3A CN108763539B (en) | 2018-05-31 | 2018-05-31 | Text classification method and system based on part-of-speech classification |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108763539A CN108763539A (en) | 2018-11-06 |
CN108763539B true CN108763539B (en) | 2020-11-10 |
Family
ID=64001297
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810551315.3A Active CN108763539B (en) | 2018-05-31 | 2018-05-31 | Text classification method and system based on part-of-speech classification |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108763539B (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110032639B (en) * | 2018-12-27 | 2023-10-31 | 中国银联股份有限公司 | Method, device and storage medium for matching semantic text data with tag |
CN110413773B (en) * | 2019-06-20 | 2023-09-22 | 平安科技(深圳)有限公司 | Intelligent text classification method, device and computer readable storage medium |
CN112184133A (en) * | 2019-07-02 | 2021-01-05 | 黎嘉明 | Artificial intelligence-based government office system preset approval and division method |
CN111090746B (en) * | 2019-11-29 | 2023-04-28 | 北京明略软件系统有限公司 | Method for determining optimal topic quantity, training method and device for emotion classifier |
CN111723206B (en) * | 2020-06-19 | 2024-01-19 | 北京明略软件系统有限公司 | Text classification method, apparatus, computer device and storage medium |
CN113761911A (en) * | 2021-03-17 | 2021-12-07 | 中科天玑数据科技股份有限公司 | Domain text labeling method based on weak supervision |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP6085888B2 (en) * | 2014-08-28 | 2017-03-01 | 有限責任監査法人トーマツ | Analysis method, analysis apparatus, and analysis program |
CN107291795B (en) * | 2017-05-03 | 2020-06-19 | 华南理工大学 | Text classification method combining dynamic word embedding and part-of-speech tagging |
-
2018
- 2018-05-31 CN CN201810551315.3A patent/CN108763539B/en active Active
Non-Patent Citations (1)
Title |
---|
《结合语义扩展和卷积神经网络的中文短文本分类方法》;卢玲 等;《计算机应用》;20171210(第12期);第3498-3503页 * |
Also Published As
Publication number | Publication date |
---|---|
CN108763539A (en) | 2018-11-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108763539B (en) | Text classification method and system based on part-of-speech classification | |
US10607598B1 (en) | Determining input data for speech processing | |
Vosoughi et al. | Tweet acts: A speech act classifier for twitter | |
Alwehaibi et al. | Comparison of pre-trained word vectors for arabic text classification using deep learning approach | |
WO2019000170A1 (en) | Generating responses in automated chatting | |
Fahad et al. | Inflectional review of deep learning on natural language processing | |
Fonseca et al. | A two-step convolutional neural network approach for semantic role labeling | |
CN111858935A (en) | Fine-grained emotion classification system for flight comment | |
Agrawal et al. | Affective representations for sarcasm detection | |
CN112052331A (en) | Method and terminal for processing text information | |
CN114528919A (en) | Natural language processing method and device and computer equipment | |
CN110705247A (en) | Based on x2-C text similarity calculation method | |
CN111339772B (en) | Russian text emotion analysis method, electronic device and storage medium | |
Tachicart et al. | Automatic identification of Moroccan colloquial Arabic | |
CN109543036A (en) | Text Clustering Method based on semantic similarity | |
Nerabie et al. | The impact of Arabic part of speech tagging on sentiment analysis: A new corpus and deep learning approach | |
Yousif | Hidden Markov Model tagger for applications based Arabic text: A review | |
Errami et al. | Sentiment Analysis onMoroccan Dialect based on ML and Social Media Content Detection | |
CN110728144A (en) | Extraction type document automatic summarization method based on context semantic perception | |
CN113065350A (en) | Biomedical text word sense disambiguation method based on attention neural network | |
Putra et al. | Sentiment Analysis on Social Media with Glove Using Combination CNN and RoBERTa | |
Zhu et al. | YUN111@ Dravidian-CodeMix-FIRE2020: Sentiment Analysis of Dravidian Code Mixed Text. | |
CN107729509B (en) | Discourse similarity determination method based on recessive high-dimensional distributed feature representation | |
KR20200040032A (en) | A method ofr classification of korean postings based on bidirectional lstm-attention | |
US20220366893A1 (en) | Systems and methods for few-shot intent classifier models |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |