CN110287311B - Text classification method and device, storage medium and computer equipment - Google Patents
Text classification method and device, storage medium and computer equipment Download PDFInfo
- Publication number
- CN110287311B CN110287311B CN201910390290.8A CN201910390290A CN110287311B CN 110287311 B CN110287311 B CN 110287311B CN 201910390290 A CN201910390290 A CN 201910390290A CN 110287311 B CN110287311 B CN 110287311B
- Authority
- CN
- China
- Prior art keywords
- text
- sample set
- classification
- samples
- sample
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 40
- 238000013145 classification model Methods 0.000 claims abstract description 56
- 238000012549 training Methods 0.000 claims abstract description 26
- 239000013598 vector Substances 0.000 claims description 22
- 230000011218 segmentation Effects 0.000 claims description 18
- 238000004364 calculation method Methods 0.000 claims description 12
- 238000010276 construction Methods 0.000 claims description 12
- 238000000605 extraction Methods 0.000 claims description 9
- 238000004590 computer program Methods 0.000 claims description 7
- 238000012545 processing Methods 0.000 claims description 7
- 238000013527 convolutional neural network Methods 0.000 claims description 5
- 229940079593 drug Drugs 0.000 description 7
- 239000003814 drug Substances 0.000 description 7
- 238000004900 laundering Methods 0.000 description 5
- GVGLGOZIDCSQPN-PVHGPHFFSA-N Heroin Chemical compound O([C@H]1[C@H](C=C[C@H]23)OC(C)=O)C4=C5[C@@]12CCN(C)[C@@H]3CC5=CC=C4OC(C)=O GVGLGOZIDCSQPN-PVHGPHFFSA-N 0.000 description 4
- 238000004891 communication Methods 0.000 description 4
- 229960002069 diamorphine Drugs 0.000 description 4
- 230000007547 defect Effects 0.000 description 3
- 238000011176 pooling Methods 0.000 description 3
- 238000010521 absorption reaction Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 239000002574 poison Substances 0.000 description 1
- 231100000614 poison Toxicity 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 230000001502 supplementing effect Effects 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3347—Query execution using vector based model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application discloses a text classification method and device, a storage medium and computer equipment, wherein the method comprises the following steps: acquiring text samples of different text types; dividing the text samples into a first sample set and a second sample set according to the number of the text samples of each text type, wherein the number of the text samples of any text type contained in the first sample set is smaller than a preset threshold value, and the number of the text samples of any text type contained in the second sample set is larger than or equal to the preset threshold value; extracting characteristic keywords from text samples contained in the first sample set; calculating the classification contribution degree of the characteristic keywords to the text samples of each text type contained in the first sample set according to the text samples contained in the first sample set; constructing a first text classifier according to the classification contribution degree; training a second text classification model using the second sample set; and classifying the text to be identified according to the first text classifier and the second text classification model.
Description
Technical Field
The present disclosure relates to the field of text classification technologies, and in particular, to a text classification method and apparatus, a storage medium, and a computer device.
Background
For the text classification problem in the field of natural language processing, the problem of inclination of training sample data is often encountered when training a machine learning or deep learning model, namely, the number of training samples of part of text types is sufficient, and the number of training samples of the other part of text types is smaller. Uneven distribution of training samples can cause model training bias, and the prediction difficulty of text types with fewer training samples is high, so that the overall prediction effect of the model is reduced.
The text classification training method in the prior art often ignores the problem, and performs a same-view on all samples or performs an oversampling strategy on small samples to supplement the samples. Neglecting this problem can lead to poor model classification, while the oversampling strategy scale is difficult to grasp, which easily results in overfitting and also cannot effectively promote model classification.
Disclosure of Invention
In view of the above, the present application provides a text classification method and apparatus, a storage medium, and a computer device, which respectively establish a classifier or a classification model for different text types with non-uniform sample number distribution, so that the recognition accuracy is higher.
According to one aspect of the present application, there is provided a text classification method, including:
acquiring text samples of different text types;
dividing the text samples into a first sample set and a second sample set according to the number of the text samples of each text type, wherein the number of the text samples of any text type contained in the first sample set is smaller than a preset threshold value, and the number of the text samples of any text type contained in the second sample set is larger than or equal to the preset threshold value;
extracting feature keywords from the text samples contained in the first sample set;
calculating the classification contribution degree of the characteristic keywords to the text samples of each text type contained in the first sample set according to the text samples contained in the first sample set;
constructing a first text classifier according to the classification contribution degree;
training a second text classification model using the second sample set;
and classifying the text to be recognized according to the first text classifier and the second text classification model.
According to another aspect of the present application, there is provided a text classification apparatus, including:
the sample acquisition module is used for acquiring text samples of different text types;
a sample set construction module, configured to divide the text samples into a first sample set and a second sample set according to the number of the text samples of each text type, where the number of the text samples of any text type included in the first sample set is smaller than a preset threshold value, and the number of the text samples of any text type included in the second sample set is greater than or equal to the preset threshold value;
the keyword extraction module is used for extracting characteristic keywords from the text samples contained in the first sample set;
the classification contribution degree calculation module is used for calculating the classification contribution degree of the characteristic keywords to the text samples of each text type contained in the first sample set according to the text samples contained in the first sample set;
the first classifier construction module is used for constructing a first text classifier according to the classification contribution degree;
the second classification model training module is used for training a second text classification model by using the second sample set;
and the classification module is used for classifying the text to be identified according to the first text classifier and the second text classification model.
According to still another aspect of the present application, there is provided a storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described text classification method.
According to still another aspect of the present application, there is provided a computer apparatus including a storage medium, a processor, and a computer program stored on the storage medium and executable on the processor, the processor implementing the above text classification method when executing the program.
By means of the technical scheme, the text classification method, the text classification device, the storage medium and the computer equipment provided by the application are characterized in that firstly, text samples of all text types are subjected to sample number size, a first sample set is built by using samples corresponding to text types with fewer sample numbers, and a second sample set is built by using samples corresponding to text types with sufficient sample numbers; then, a first text classifier and a second text classification model are established by utilizing a first sample set and a second sample set respectively aiming at the text type of a small sample and the text type of a large sample, wherein the first text classifier determines the classification contribution degree of the extracted characteristic keywords in the first sample set to each text type; and finally, realizing the text type identification of the text to be identified by using the first text classifier and the second text classification model. According to the method and the device, a first text classifier suitable for recognizing small sample text types and a second text classification model suitable for recognizing large sample text types are respectively built aiming at sample texts of different text types, and compared with the defect that in the prior art, only one model for recognizing all the text types is built, the text types with insufficient sample numbers are submerged in the text types with sufficient sample numbers, the method and the device are higher in recognition accuracy when being used for respectively building the classifier or the classification model aiming at different text types.
The foregoing description is only an overview of the technical solutions of the present application, and may be implemented according to the content of the specification in order to make the technical means of the present application more clearly understood, and in order to make the above-mentioned and other objects, features and advantages of the present application more clearly understood, the following detailed description of the present application will be given.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:
fig. 1 shows a schematic flow chart of a text classification method according to an embodiment of the present application;
FIG. 2 is a flow chart illustrating another text classification method according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of a text classification device according to an embodiment of the present application;
fig. 4 shows a schematic structural diagram of another text classification device according to an embodiment of the present application.
Detailed Description
The present application will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other.
In this embodiment, a text classification method is provided, as shown in fig. 1, and the method includes:
and step 101, acquiring text samples of different text types.
In this embodiment, taking some text types in identifying legal aspects as an example, text samples of different text types related to legal are obtained, where the text samples may be a segment of text, and the text types may specifically include the following types: theft, dangerous driving, holding others for drug, vending, fraud, slurs, money laundering, resale, smuggling, etc.
In the above obtained text samples, like some common crime types: the sample number of theft, dangerous driving, taking other people to take the poison is quite sufficient, and some unusual criminals: the sample number of defamation, money laundering, resale of cultural relics, smuggling weapon ammunition and the like is small. This results in sample data tilting, where a sufficient number of samples are marked as large samples and an insufficient number of samples are marked as small samples. The second sample set is established with all large samples and the first sample set is established with all small samples. The method comprises the steps of marking a large sample and a small sample according to a preset threshold, marking the large sample and the small sample as the small sample if the number of the text samples of a certain type is smaller than the preset threshold, and marking the large sample as the large sample if the number of the text samples of a certain type is larger than or equal to the preset threshold, so that a first sample set and a second sample set are established, and identification models of the text types contained in the first sample set and the text types contained in the second sample set are established respectively.
And step 103, extracting feature keywords from the text samples contained in the first sample set.
For text types with a small number of samples, to build a classifier that identifies these text types, feature keywords, which are typically phrases that often occur in the text samples contained in the first sample set, should first be extracted from all the text samples in the first sample set.
According to the extracted feature keywords, the classification contribution degree of each feature keyword to each text type contained in the first sample set is calculated, and the meaning of the classification contribution degree can be described as how much the probability that a certain feature keyword belongs to a certain text type is increased if the certain feature keyword appears in the certain text. For example, when a "heroin" appears in a piece of text, assuming that the classification contribution degree of the "heroin" to the text type of "drug" is 5% and the classification contribution degree of the "heroin" to the text type of "drug" is 4%, when the phrase of "heroin" appears in a piece of text, the probability that the piece of text belongs to the drug type increases by 5%, and the probability that the piece of text belongs to the drug type increases by 4%.
And 105, constructing a first text classifier according to the classification contribution degree.
After the classification contribution degree of each feature keyword to different text types is calculated, a first text classifier can be established according to the classification contribution degree, wherein the first text classifier is used for calculating the probability that the text to be identified belongs to each text type contained in a first sample set, for example, the first sample set contains text types such as defamation, money laundering, resale, smuggling weapon ammunition and the like, and the first text classifier can be used for calculating the probability that the text to be identified belongs to any one of the text types such as defaults, money laundering, resale, smuggling weapon ammunition and the like.
And step 106, training a second text classification model by using the second sample set.
For text types with sufficient sample number, a second text classification model may be trained by using a second sample set, where the second text classification model is used to calculate a probability that the text to be identified belongs to each text type included in the second sample set, for example, the second sample set includes text types such as theft, dangerous driving, and drug absorption by other people, and the probability that the text to be identified belongs to any one of the text types such as theft, dangerous driving, and drug absorption by other people may be calculated by using the trained second text classification model.
And step 107, classifying the text to be recognized according to the first text classifier and the second text classification model.
After the first text classifier and the second text classification model are obtained, the text to be identified can be classified, and the text type of the text to be identified can be identified, wherein the text to be identified can belong to the text type contained in the first sample set or the text type contained in the second sample set.
By applying the technical scheme of the embodiment, firstly, according to the size of the sample number, the text samples of each text type are used for establishing a first sample set by using the samples corresponding to the text types with the smaller sample number and establishing a second sample set by using the samples corresponding to the text types with sufficient sample number; then, a first text classifier and a second text classification model are established by utilizing a first sample set and a second sample set respectively aiming at the text type of a small sample and the text type of a large sample, wherein the first text classifier determines the classification contribution degree of the extracted characteristic keywords in the first sample set to each text type; and finally, realizing the text type identification of the text to be identified by using the first text classifier and the second text classification model. According to the method and the device, a first text classifier suitable for recognizing small sample text types and a second text classification model suitable for recognizing large sample text types are respectively built aiming at sample texts of different text types, and compared with the defect that in the prior art, only one model for recognizing all the text types is built, the text types with insufficient sample numbers are submerged in the text types with sufficient sample numbers, the method and the device are higher in recognition accuracy when being used for respectively building the classifier or the classification model aiming at different text types.
Further, as a refinement and extension of the foregoing embodiment, in order to fully describe the implementation procedure of this embodiment, another text classification method is provided, as shown in fig. 2, where the method includes:
According to the size of the text sample number corresponding to each text type, the text samples are divided into a first sample set and a second sample set, for example, the text samples corresponding to the text types with the text sample number smaller than 1000 are divided into the first sample set, and the text samples corresponding to the text types with the text sample number larger than or equal to 1000 are divided into the second sample set.
And 203, performing word segmentation processing on the text samples contained in the first sample set according to a preset phrase comparison table to extract feature words.
According to a preset phrase comparison table, the text samples in the first sample set are segmented, the preset phrase comparison table can be regarded as a dictionary, the dictionary contains a plurality of preset phrases which are beneficial to text classification, the word segmentation process is to extract phrases which are the same as the phrases in the dictionary from the text samples, and the extracted phrases are used as characteristic words of the text samples.
After extracting the feature words, the feature words need to be further screened, specifically, the number of occurrence times of each feature word in a text sample, namely the number of each feature word, is counted, then the feature words with more occurrence times are screened out to be used as feature keywords, for example, the feature words with 60% of the occurrence times of the feature words are used as the key feature words, 60% of the feature words can be adjusted to other values, and the method is not limited.
In addition, in order to avoid that the same feature word repeatedly appears in one or several text samples, the number of the feature word is higher than that of other feature words because of the one or several text samples, for example, phrase a appears 50 times in one text sample, appears only 10 times in other text samples, and phrase B appears 30 times in all text samples, it is obviously not reasonable if phrase a is selected as a key feature word if only one seat feature keyword can be selected in phrase a and phrase B, and the number of times of appearance of the feature word in each text sample is not considered. Therefore, in the embodiment of the present application, when the number of each feature word is counted, the same feature word in each text sample is counted only once, that is, the number of each feature word should be the number of the text samples containing the feature word.
Specifically, the classification contribution is calculated according to a classification contribution calculation formula, wherein the classification contribution calculation formula is that
Where i= (1, 2,) m, m being the number of text types contained in the first sample set, C i Represents the i-th text type, j= (1, 2,) and n, n being the number of feature keywords, X j Represents the j-th feature keyword, P (X j |C i ) Representing characteristic keyword X j In text type C i Probability of occurrence in text samples of (C), P (C) i ) Representing text type C i The ratio of the number of text samples in the first sample set to the number of all text samples in the first sample set, P (X) represents a predetermined coefficient, P (C) i |X j ) Representing characteristic keyword X j For text typesC i The classification contribution of the text sample of (c).
For example, C1 represents a defamation text type, C2 represents a money laundering text type, C3 represents a resale text type, and C4 represents a smuggling weapon ammunition text type.
Note that P (X j |C i ) The specific calculation method is exemplified as follows: assuming that there are 10 training samples of class labels Ci, the probability is 9/10=90% if the feature keyword Xj appears in 9 samples among the 10 samples, and is 3/10=30% if the feature keyword Xj appears in 3 samples among the 10 samples.
And 206, constructing a first text classifier according to the classification contribution degree.
Specifically, a first text classifier is constructed according to a first text classification formula, wherein the first text classification formula is that
Where k= (1, 2,) i, i is the number of feature words in the sample Y to be predicted, Y k The kth feature word, P (C i |y k ) Representing the characteristic word y k For text type C i Classification contribution of text sample of (a) if feature word y k And feature keyword X j Identical, then P (C i |y k )=P(C i |X j ) If the characteristic word y k And feature keyword X j Identical, then P (C i |y k )=0。
Specifically, assuming that the first sample set contains 20 feature keywords in total, then for any one text type Ci, P (C i |X j ) Then 20 values are represented, which represent the probability of the 20 feature keywords contributing to the classification of category Ci, and P (C i |y k ) Y in (a) k Representing characteristic words in a sample to be predicted, specifically according to presetExtracting the phrase comparison table, so that the condition that one or more of the feature words in the sample to be predicted are not in the feature keywords X corresponding to the first sample set can occur, and for the feature words in the sample to be predicted, which are inconsistent with the original feature keywords X, marking the classification tribute word donation probability of the feature words as 0 in the calculation.
The text sample is subjected to word segmentation, illegal characters can be removed, stop word processing is performed, and the phrase corresponding to the four texts obtained after the word segmentation of the four texts is assumed to be respectively: text a [ "a" ]; text B [ "B", "c", "B" ]; text C [ "a", "C" ]; text D [ "c", "D" ].
The phrase corresponding to each text sample is constructed, the text sample selected in the embodiment of the application is generally short text, the text vector is generally controlled in a preset dimension, interception is performed if the text is longer than the preset dimension, and 0 element is used for supplementing if the text is shorter than the preset dimension.
The second text classification model may be a convolutional neural network model, or may be an SVM support vector machine model or other models that are currently used for text classification. Wherein, the architecture of convolutional neural network includes: convolution layer-pooling layer-full connection layer. The convolution layer is used as a feature extraction layer, text features are extracted through a filter, a feature map is generated through convolution kernel function operation and is output to a pooling layer, the pooling layer belongs to a feature mapping layer, downsampling is carried out on the feature map generated by the convolution layer, optimal features are output, the full-connection softmax layer is used for completing classification tasks, and classification probability of each text feature contained in a second sample set is output.
When the text type of the text to be recognized is recognized, firstly, word segmentation processing is carried out on the text to be recognized to obtain a plurality of phrases contained in the text to be recognized, and then the recognition of the text type is realized according to the phrases.
Since the text types contained in the second sample set are more common text types, whether the text to be recognized belongs to the text types contained in the second sample set or not is judged by utilizing the second text classification model, and particularly, word vector conversion is carried out on the word groups contained in the text to be recognized.
And inputting the word vector corresponding to the text to be recognized into a second text classification model, wherein the second text classification model outputs the probability that the text to be recognized belongs to certain text types, and the certain text types refer to the text types contained in the text samples in the second sample set.
And 213, if the maximum value in the probabilities is greater than or equal to the preset classification probability, determining the text type corresponding to the maximum value in the probabilities as the text type of the text to be recognized.
And finding out the maximum probability value according to the calculated probability that the text to be recognized belongs to each text type contained in the second sample set, and if the maximum probability is greater than or equal to the preset classification probability, indicating that the probability that the text to be recognized belongs to the text type corresponding to the maximum probability value is very high, determining the text type corresponding to the maximum probability value as the text type of the text to be recognized.
If the maximum probability value is smaller than the preset classification probability, the probability that the text to be identified belongs to the text type corresponding to the second sample set is lower, the text type of the text to be identified is identified by the first text classifier, and the classification contribution degree corresponding to the phrase extracted from the text to be identified is determined according to the calculated classification contribution degree corresponding to the characteristic key words. Specifically, if the phrase belongs to the above feature keyword, determining the classification contribution degree corresponding to the feature keyword as the classification contribution degree of the phrase, and if the phrase does not belong to the above feature keyword, determining the classification contribution degree of the phrase as 0.
Using a first text classifier, a probability is calculated that the text to be identified belongs to certain text types, where certain text types refer to text types corresponding to the first text set.
And step 216, determining the text type corresponding to the maximum value in the probability that the text to be recognized belongs to each text type contained in the first sample set as the text type of the text to be recognized.
And determining the text type corresponding to the maximum probability calculated by using the first text classifier as the text type of the text to be recognized. It should be noted that, the probability here is not an actual probability that the text to be recognized belongs to a certain text type, for example, if the probability that the text to be recognized belongs to the text type X is calculated to be 80%, and it does not represent that the actual probability that the text to be predicted belongs to the text type X is 80%, the probability is only used for comparing the probability with the probabilities that the text to be predicted belongs to other text types, so as to find the text type corresponding to the maximum probability and determine the text type as the text type of the text to be recognized.
According to the technical scheme, according to the number of text samples of different text types, a first sample set is built by using fewer text samples, a second sample set is built by using enough text samples, so that a first text classifier suitable for distinguishing the text types with insufficient sample numbers is built by using the first sample set, a second text classification model suitable for distinguishing the text types with sufficient sample numbers is trained by using the second sample set, the problem that the model is built inaccurately due to uneven distribution of different types of text samples is avoided, when the text to be recognized is classified, the text type of the text to be recognized is determined by using the second text classification model, and if the text to be recognized does not belong to the text type corresponding to the second sample set, the text type of the text to be recognized is recognized by using the first text classifier, so that the determination of the text type of the text to be recognized is realized, the problem that the recognition difficulty of the text type with fewer text samples is high is solved, and the overall recognition accuracy is improved.
Further, as a specific implementation of the method of fig. 1, an embodiment of the present application provides a text classification apparatus, as shown in fig. 3, where the apparatus includes: the system comprises a sample acquisition module 31, a sample set construction module 32, a keyword extraction module 33, a classification contribution calculation module 34, a first classifier construction module 35, a second classification model training module 36 and a classification module 37.
A sample acquiring module 31, configured to acquire text samples of different text types;
a sample set construction module 32, configured to divide the text samples into a first sample set and a second sample set according to the number of text samples of each text type, where the number of text samples of any text type included in the first sample set is less than a preset threshold value, and the number of text samples of any text type included in the second sample set is greater than or equal to the preset threshold value;
a keyword extraction module 33, configured to extract feature keywords from text samples included in the first sample set;
a classification contribution calculation module 34, configured to calculate, according to the text samples included in the first sample set, a classification contribution of the feature keyword to the text samples of each text type included in the first sample set;
a first classifier construction module 35, configured to construct a first text classifier according to the classification contribution;
a second classification model training module 36 for training a second text classification model using the second sample set;
the classification module 37 is configured to classify the text to be identified according to the first text classifier and the second text classification model.
In a specific application scenario, as shown in fig. 4, the keyword extraction module 33 specifically includes: a feature word extraction unit 331 and a keyword extraction unit 332.
The feature word extracting unit 331 is configured to perform word segmentation processing on a text sample included in the first sample set according to a preset phrase comparison table to extract feature words;
the keyword extraction unit 332 is configured to count the number of each feature word, and determine the feature keywords according to the number of each feature word.
The classification contribution calculating module 34 is specifically configured to calculate a classification contribution according to a classification contribution calculating formula, where the classification contribution calculating formula is
Where i= (1, 2,) m, m being the number of text types contained in the first sample set, C i Represents the i-th text type, j= (1, 2,) and n, n being the number of feature keywords, X j Represents the j-th feature keyword, P (X j |C i ) Representing characteristic keyword X j In text type C i Probability of occurrence in text samples of (C), P (C) i ) Representing text type C i The ratio of the number of text samples in the first sample set to the number of all text samples in the first sample set, P (X) represents a predetermined coefficient, P (C) i |X j ) Representing characteristic keyword X j For text type C i The classification contribution of the text sample of (c).
The first classifier construction module 35 is specifically configured to construct a first text classifier according to a first text classification formula, where the first text classification formula is
Where k= (1, 2,) i, i is the number of feature words in the sample Y to be predicted, Y k The kth feature word, P (C i |y k ) Representing the characteristic word y k For text type C i Classification contribution of text sample of (a) if feature word y k And feature keyword X j Identical, then P (C i |y k )=P(C i |X j ) If the characteristic word y k And feature keyword X j Identical, then P (C i |y k )=0。
In a specific application scenario, as shown in fig. 4, the second classification model training module 36 specifically includes: a first word segmentation unit 361, a first word vector construction unit 362, and a second classification model training unit 363.
A first word segmentation unit 361, configured to perform word segmentation on the text samples in the second sample set to obtain a phrase corresponding to each text sample;
a first word vector constructing unit 362, configured to construct a text vector corresponding to each text sample according to the phrase corresponding to each text sample;
a second classification model training unit 363 is configured to train the second text classification model by using the text vector and the text type of the text sample corresponding to the text vector, where the second text classification model is a convolutional neural network model.
In a specific application scenario, as shown in fig. 4, the classification module 37 specifically includes: a second word segmentation unit 371, a second word vector construction unit 372, a second text type recognition unit 373, a second text type determination unit 374, a classification contribution degree determination unit 375, a first text type recognition unit 376, and a first text type determination unit 377.
The second word segmentation unit 371 is used for performing word segmentation on the text to be recognized to obtain a phrase contained in the text to be recognized;
a second word vector construction unit 372, configured to convert a phrase included in the text to be recognized into a word vector corresponding to the text to be recognized;
a second text type recognition unit 373, configured to input a word vector corresponding to the text to be recognized into a second text classification model, so as to obtain a probability that the text to be recognized belongs to each text type included in the second sample set;
and a second text type determining unit 374, configured to determine, as the text type of the text to be recognized, the text type corresponding to the maximum value in the probabilities if the maximum value in the probabilities is greater than or equal to the preset classification probability.
The classification contribution degree determining unit 375 is configured to determine, if the maximum value of the probabilities is smaller than a preset classification probability, a classification contribution degree corresponding to a phrase included in the text to be identified according to the classification contribution degree corresponding to the feature keyword;
a first text type recognition unit 376, configured to calculate, according to a classification contribution corresponding to a phrase included in the text to be recognized and a first text classifier, a probability that the text to be recognized belongs to each text type included in the first sample set;
the first text type determining unit 377 is configured to determine, as the text type of the text to be recognized, the text type corresponding to the maximum value in the probabilities that the text to be recognized belongs to each text type included in the first sample set.
It should be noted that, for other corresponding descriptions of each functional unit related to the text classification device provided in the embodiment of the present application, reference may be made to corresponding descriptions in fig. 1 and fig. 2, and no further description is given here.
Based on the above-mentioned methods shown in fig. 1 and 2, correspondingly, the embodiments of the present application further provide a storage medium, on which a computer program is stored, which when executed by a processor, implements the above-mentioned text classification method shown in fig. 1 and 2.
Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.), and includes several instructions for causing a computer device (may be a personal computer, a server, or a network device, etc.) to perform the methods described in various implementation scenarios of the present application.
Based on the methods shown in fig. 1 and fig. 2 and the virtual device embodiments shown in fig. 3 and fig. 4, in order to achieve the above objects, the embodiments of the present application further provide a computer device, which may specifically be a personal computer, a server, a network device, etc., where the computer device includes a storage medium and a processor; a storage medium storing a computer program; a processor for executing a computer program to implement the text classification method as described above and shown in fig. 1 and 2.
Optionally, the computer device may also include a user interface, a network interface, a camera, radio frequency (RadioFrequency, RF) circuitry, sensors, audio circuitry, WI-FI modules, and the like. The user interface may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, etc. The network interface may optionally include a standard wired interface, a wireless interface (e.g., bluetooth interface, WI-FI interface), etc.
It will be appreciated by those skilled in the art that the architecture of a computer device provided in the present embodiment is not limited to the computer device, and may include more or fewer components, or may combine certain components, or may be arranged in different components.
The storage medium may also include an operating system, a network communication module. An operating system is a program that manages and saves computer device hardware and software resources, supporting the execution of information handling programs and other software and/or programs. The network communication module is used for realizing communication among all components in the storage medium and communication with other hardware and software in the entity equipment.
Through the description of the above embodiments, it can be clearly understood by those skilled in the art that the present application may be implemented by means of software plus a necessary general hardware platform, or may be implemented by hardware, where first text samples of each text type are according to the size of the number of samples, a first sample set is established by using samples corresponding to text types with a smaller number of samples, and a second sample set is established by using samples corresponding to text types with a sufficient number of samples; then, a first text classifier and a second text classification model are established by utilizing a first sample set and a second sample set respectively aiming at the text type of a small sample and the text type of a large sample, wherein the first text classifier determines the classification contribution degree of the extracted characteristic keywords in the first sample set to each text type; and finally, realizing the text type identification of the text to be identified by using the first text classifier and the second text classification model. According to the method and the device, a first text classifier suitable for recognizing small sample text types and a second text classification model suitable for recognizing large sample text types are respectively built aiming at sample texts of different text types, and compared with the defect that in the prior art, only one model for recognizing all the text types is built, the text types with insufficient sample numbers are submerged in the text types with sufficient sample numbers, the method and the device are higher in recognition accuracy when being used for respectively building the classifier or the classification model aiming at different text types.
Those skilled in the art will appreciate that the drawings are merely schematic illustrations of one preferred implementation scenario, and that the modules or flows in the drawings are not necessarily required to practice the present application. Those skilled in the art will appreciate that modules in an apparatus in an implementation scenario may be distributed in an apparatus in an implementation scenario according to an implementation scenario description, or that corresponding changes may be located in one or more apparatuses different from the implementation scenario. The modules of the implementation scenario may be combined into one module, or may be further split into a plurality of sub-modules.
The foregoing application serial numbers are merely for description, and do not represent advantages or disadvantages of the implementation scenario. The foregoing disclosure is merely a few specific implementations of the present application, but the present application is not limited thereto and any variations that can be considered by a person skilled in the art shall fall within the protection scope of the present application.
Claims (8)
1. A method of text classification, comprising:
acquiring text samples of different text types;
dividing the text samples into a first sample set and a second sample set according to the number of the text samples of each text type, wherein the number of the text samples of any text type contained in the first sample set is smaller than a preset threshold value, and the number of the text samples of any text type contained in the second sample set is larger than or equal to the preset threshold value;
according to a preset phrase comparison table, performing word segmentation processing on text samples contained in the first sample set to extract feature words; counting the number of each feature word, and determining the feature keywords according to the number of each feature word;
calculating the classification contribution degree of the characteristic keywords to the text samples of each text type contained in the first sample set according to the text samples contained in the first sample set;
constructing a first text classifier according to the classification contribution degree;
training a second text classification model using the second sample set;
classifying the text to be identified according to the first text classifier and the second text classification model;
wherein the classification contribution is calculated according to a classification contribution calculation formula, and the classification contribution calculation formula is that
M is the number of text types contained in said first sample set,/o>Represents the i text type,/->N is the number of feature keywords, < ->Represents the j-th feature keyword, ++>Representing feature keywords +.>In text type->Probability of occurrence in text samples of +.>Representing the text type +.>The ratio of the number of text samples in said first sample set to the number of all text samples in said first sample set,/for>The preset coefficient is indicated to be the same as the preset coefficient,representing the characteristic keyword ++>For the text type->The classification contribution of the text sample of (c).
2. The method according to claim 1, wherein constructing a first text classifier according to the classification contribution comprises:
constructing a first text classifier according to a first text classification formula, wherein the first text classification formula is that
wherein ,l is the number of feature words in the sample Y to be predicted, < +.>K-th feature word representing sample Y to be predicted, < ->Representing characteristic words ++>For text type->Classification contribution of the text sample of (2) if the feature word +.>And the characteristic keyword->Identical->=/>If the characteristic word->And the characteristic key wordsIdentical->=0。
3. The method according to claim 2, wherein training a second text classification model using the second sample set, in particular comprises:
word segmentation is carried out on the text samples in the second sample set, so that phrases corresponding to each text sample are obtained;
constructing a text vector corresponding to each text sample according to the phrase corresponding to each text sample;
and training the second text classification model by using the text vector and the text type of the text sample corresponding to the text vector, wherein the second text classification model is a convolutional neural network model.
4. A method according to claim 3, wherein classifying the text to be identified according to the first text classifier and the second text classification model comprises:
word segmentation is carried out on the text to be identified, so that a phrase contained in the text to be identified is obtained;
converting the word group contained in the text to be recognized into a word vector corresponding to the text to be recognized;
inputting word vectors corresponding to the texts to be recognized into the second text classification model to obtain the probability that the texts to be recognized belong to each text type contained in the second sample set;
if the maximum value in the probabilities is greater than or equal to a preset classification probability, determining the text type corresponding to the maximum value in the probabilities as the text type of the text to be identified.
5. The method of claim 4, wherein classifying the text to be identified according to the first text classifier and the second text classification model, in particular further comprises:
if the maximum value in the probabilities is smaller than the preset classification probability, determining the classification contribution degree corresponding to the phrase contained in the text to be identified according to the classification contribution degree corresponding to the characteristic key words;
calculating the probability that the text to be recognized belongs to each text type contained in the first sample set according to the classification contribution degree corresponding to the phrase contained in the text to be recognized and the first text classifier;
and determining the text type corresponding to the maximum value in the probability that the text to be identified belongs to each text type contained in the first sample set as the text type of the text to be identified.
6. A text classification device, comprising:
the sample acquisition module is used for acquiring text samples of different text types;
a sample set construction module, configured to divide the text samples into a first sample set and a second sample set according to the number of the text samples of each text type, where the number of the text samples of any text type included in the first sample set is less than a preset threshold value, and the number of the text samples of any text type included in the second sample set is greater than or equal to the preset threshold value;
the keyword extraction module is used for carrying out word segmentation processing on the text samples contained in the first sample set according to a preset phrase comparison table to extract feature words; counting the number of each feature word, and determining the feature keywords according to the number of each feature word;
the classification contribution degree calculation module is used for calculating the classification contribution degree of the characteristic keywords to the text samples of each text type contained in the first sample set according to the text samples contained in the first sample set;
the first classifier construction module is used for constructing a first text classifier according to the classification contribution degree;
the second classification model training module is used for training a second text classification model by using the second sample set;
the classification module is used for classifying the text to be identified according to the first text classifier and the second text classification model;
wherein the classification contribution is calculated according to a classification contribution calculation formula, and the classification contribution calculation formula is that
M is the number of text types contained in said first sample set,/o>Represents the i text type,/->N is the number of feature keywords, < ->Represents the j-th feature keyword, ++>Representing feature keywords +.>In text type->Probability of occurrence in text samples of +.>Representing the text type +.>The ratio of the number of text samples in said first sample set to the number of all text samples in said first sample set,/for>The preset coefficient is indicated to be the same as the preset coefficient,representing the characteristic keyword ++>For the text type->The classification contribution of the text sample of (c).
7. A storage medium having stored thereon a computer program, wherein the program when executed by a processor implements the text classification method of any of claims 1 to 5.
8. A computer device comprising a storage medium, a processor and a computer program stored on the storage medium and executable on the processor, characterized in that the processor implements the text classification method of any of claims 1 to 5 when executing the program.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910390290.8A CN110287311B (en) | 2019-05-10 | 2019-05-10 | Text classification method and device, storage medium and computer equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910390290.8A CN110287311B (en) | 2019-05-10 | 2019-05-10 | Text classification method and device, storage medium and computer equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110287311A CN110287311A (en) | 2019-09-27 |
CN110287311B true CN110287311B (en) | 2023-05-26 |
Family
ID=68001583
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910390290.8A Active CN110287311B (en) | 2019-05-10 | 2019-05-10 | Text classification method and device, storage medium and computer equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110287311B (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111400499A (en) * | 2020-03-24 | 2020-07-10 | 网易(杭州)网络有限公司 | Training method of document classification model, document classification method, device and equipment |
CN112329475B (en) * | 2020-11-03 | 2022-05-20 | 海信视像科技股份有限公司 | Statement processing method and device |
CN113011503B (en) * | 2021-03-17 | 2021-11-23 | 彭黎文 | Data evidence obtaining method of electronic equipment, storage medium and terminal |
CN113051385B (en) * | 2021-04-28 | 2023-05-26 | 杭州网易再顾科技有限公司 | Method, medium, device and computing equipment for intention recognition |
CN113779257A (en) * | 2021-09-28 | 2021-12-10 | 京东城市(北京)数字科技有限公司 | Method, device, equipment, medium and product for analyzing text classification model |
CN116226382B (en) * | 2023-02-28 | 2023-08-01 | 北京数美时代科技有限公司 | Text classification method and device for given keywords, electronic equipment and medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2009078096A1 (en) * | 2007-12-18 | 2009-06-25 | Fujitsu Limited | Generating method of two class classification prediction model, program for generating classification prediction model and generating device of two class classification prediction model |
CN102081627A (en) * | 2009-11-27 | 2011-06-01 | 北京金山软件有限公司 | Method and system for determining contribution degree of word in text |
CN106294466A (en) * | 2015-06-02 | 2017-01-04 | 富士通株式会社 | Disaggregated model construction method, disaggregated model build equipment and sorting technique |
CN106610949A (en) * | 2016-09-29 | 2017-05-03 | 四川用联信息技术有限公司 | Text feature extraction method based on semantic analysis |
CN109492026A (en) * | 2018-11-02 | 2019-03-19 | 国家计算机网络与信息安全管理中心 | A kind of Telecoms Fraud classification and Detection method based on improved active learning techniques |
CN109583474A (en) * | 2018-11-01 | 2019-04-05 | 华中科技大学 | A kind of training sample generation method for the processing of industrial big data |
-
2019
- 2019-05-10 CN CN201910390290.8A patent/CN110287311B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2009078096A1 (en) * | 2007-12-18 | 2009-06-25 | Fujitsu Limited | Generating method of two class classification prediction model, program for generating classification prediction model and generating device of two class classification prediction model |
CN102081627A (en) * | 2009-11-27 | 2011-06-01 | 北京金山软件有限公司 | Method and system for determining contribution degree of word in text |
CN106294466A (en) * | 2015-06-02 | 2017-01-04 | 富士通株式会社 | Disaggregated model construction method, disaggregated model build equipment and sorting technique |
CN106610949A (en) * | 2016-09-29 | 2017-05-03 | 四川用联信息技术有限公司 | Text feature extraction method based on semantic analysis |
CN109583474A (en) * | 2018-11-01 | 2019-04-05 | 华中科技大学 | A kind of training sample generation method for the processing of industrial big data |
CN109492026A (en) * | 2018-11-02 | 2019-03-19 | 国家计算机网络与信息安全管理中心 | A kind of Telecoms Fraud classification and Detection method based on improved active learning techniques |
Also Published As
Publication number | Publication date |
---|---|
CN110287311A (en) | 2019-09-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110287311B (en) | Text classification method and device, storage medium and computer equipment | |
CN110287312B (en) | Text similarity calculation method, device, computer equipment and computer storage medium | |
CN110362677B (en) | Text data category identification method and device, storage medium and computer equipment | |
Goodfellow et al. | Multi-digit number recognition from street view imagery using deep convolutional neural networks | |
CN101477544B (en) | Rubbish text recognition method and system | |
CN104205126B (en) | The identification without spin of classifying hand-written characters | |
US8041120B2 (en) | Unified digital ink recognition | |
CN107391760A (en) | User interest recognition methods, device and computer-readable recording medium | |
CN108228845B (en) | Mobile phone game classification method | |
CN110457677B (en) | Entity relationship identification method and device, storage medium and computer equipment | |
CN113254643B (en) | Text classification method and device, electronic equipment and text classification program | |
US20200387783A1 (en) | Fast Nearest Neighbor Search for Output Generation of Convolutional Neural Networks | |
CN107180084A (en) | Word library updating method and device | |
CN106991312B (en) | Internet anti-fraud authentication method based on voiceprint recognition | |
CN113094478B (en) | Expression reply method, device, equipment and storage medium | |
CN111694954B (en) | Image classification method and device and electronic equipment | |
CN112507912A (en) | Method and device for identifying illegal picture | |
CN110717407A (en) | Human face recognition method, device and storage medium based on lip language password | |
CN103870840A (en) | Improved latent Dirichlet allocation-based natural image classification method | |
CN111506726A (en) | Short text clustering method and device based on part-of-speech coding and computer equipment | |
CN114579743A (en) | Attention-based text classification method and device and computer readable medium | |
CN112632248A (en) | Question answering method, device, computer equipment and storage medium | |
CN104573683A (en) | Character string recognizing method and device | |
CN112489689A (en) | Cross-database voice emotion recognition method and device based on multi-scale difference confrontation | |
CN115795394A (en) | Biological feature fusion identity recognition method for hierarchical multi-modal and advanced incremental learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |