CN110287311B - Text classification method and device, storage medium and computer equipment - Google Patents

Text classification method and device, storage medium and computer equipment Download PDF

Info

Publication number
CN110287311B
CN110287311B CN201910390290.8A CN201910390290A CN110287311B CN 110287311 B CN110287311 B CN 110287311B CN 201910390290 A CN201910390290 A CN 201910390290A CN 110287311 B CN110287311 B CN 110287311B
Authority
CN
China
Prior art keywords
text
sample set
classification
samples
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910390290.8A
Other languages
Chinese (zh)
Other versions
CN110287311A (en
Inventor
钱柏丞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201910390290.8A priority Critical patent/CN110287311B/en
Publication of CN110287311A publication Critical patent/CN110287311A/en
Application granted granted Critical
Publication of CN110287311B publication Critical patent/CN110287311B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Abstract

The application discloses a text classification method and device, a storage medium and computer equipment, wherein the method comprises the following steps: acquiring text samples of different text types; dividing the text samples into a first sample set and a second sample set according to the number of the text samples of each text type, wherein the number of the text samples of any text type contained in the first sample set is smaller than a preset threshold value, and the number of the text samples of any text type contained in the second sample set is larger than or equal to the preset threshold value; extracting characteristic keywords from text samples contained in the first sample set; calculating the classification contribution degree of the characteristic keywords to the text samples of each text type contained in the first sample set according to the text samples contained in the first sample set; constructing a first text classifier according to the classification contribution degree; training a second text classification model using the second sample set; and classifying the text to be identified according to the first text classifier and the second text classification model.

Description

Text classification method and device, storage medium and computer equipment
Technical Field
The present disclosure relates to the field of text classification technologies, and in particular, to a text classification method and apparatus, a storage medium, and a computer device.
Background
For the text classification problem in the field of natural language processing, the problem of inclination of training sample data is often encountered when training a machine learning or deep learning model, namely, the number of training samples of part of text types is sufficient, and the number of training samples of the other part of text types is smaller. Uneven distribution of training samples can cause model training bias, and the prediction difficulty of text types with fewer training samples is high, so that the overall prediction effect of the model is reduced.
The text classification training method in the prior art often ignores the problem, and performs a same-view on all samples or performs an oversampling strategy on small samples to supplement the samples. Neglecting this problem can lead to poor model classification, while the oversampling strategy scale is difficult to grasp, which easily results in overfitting and also cannot effectively promote model classification.
Disclosure of Invention
In view of the above, the present application provides a text classification method and apparatus, a storage medium, and a computer device, which respectively establish a classifier or a classification model for different text types with non-uniform sample number distribution, so that the recognition accuracy is higher.
According to one aspect of the present application, there is provided a text classification method, including:
acquiring text samples of different text types;
dividing the text samples into a first sample set and a second sample set according to the number of the text samples of each text type, wherein the number of the text samples of any text type contained in the first sample set is smaller than a preset threshold value, and the number of the text samples of any text type contained in the second sample set is larger than or equal to the preset threshold value;
extracting feature keywords from the text samples contained in the first sample set;
calculating the classification contribution degree of the characteristic keywords to the text samples of each text type contained in the first sample set according to the text samples contained in the first sample set;
constructing a first text classifier according to the classification contribution degree;
training a second text classification model using the second sample set;
and classifying the text to be recognized according to the first text classifier and the second text classification model.
According to another aspect of the present application, there is provided a text classification apparatus, including:
the sample acquisition module is used for acquiring text samples of different text types;
a sample set construction module, configured to divide the text samples into a first sample set and a second sample set according to the number of the text samples of each text type, where the number of the text samples of any text type included in the first sample set is smaller than a preset threshold value, and the number of the text samples of any text type included in the second sample set is greater than or equal to the preset threshold value;
the keyword extraction module is used for extracting characteristic keywords from the text samples contained in the first sample set;
the classification contribution degree calculation module is used for calculating the classification contribution degree of the characteristic keywords to the text samples of each text type contained in the first sample set according to the text samples contained in the first sample set;
the first classifier construction module is used for constructing a first text classifier according to the classification contribution degree;
the second classification model training module is used for training a second text classification model by using the second sample set;
and the classification module is used for classifying the text to be identified according to the first text classifier and the second text classification model.
According to still another aspect of the present application, there is provided a storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described text classification method.
According to still another aspect of the present application, there is provided a computer apparatus including a storage medium, a processor, and a computer program stored on the storage medium and executable on the processor, the processor implementing the above text classification method when executing the program.
By means of the technical scheme, the text classification method, the text classification device, the storage medium and the computer equipment provided by the application are characterized in that firstly, text samples of all text types are subjected to sample number size, a first sample set is built by using samples corresponding to text types with fewer sample numbers, and a second sample set is built by using samples corresponding to text types with sufficient sample numbers; then, a first text classifier and a second text classification model are established by utilizing a first sample set and a second sample set respectively aiming at the text type of a small sample and the text type of a large sample, wherein the first text classifier determines the classification contribution degree of the extracted characteristic keywords in the first sample set to each text type; and finally, realizing the text type identification of the text to be identified by using the first text classifier and the second text classification model. According to the method and the device, a first text classifier suitable for recognizing small sample text types and a second text classification model suitable for recognizing large sample text types are respectively built aiming at sample texts of different text types, and compared with the defect that in the prior art, only one model for recognizing all the text types is built, the text types with insufficient sample numbers are submerged in the text types with sufficient sample numbers, the method and the device are higher in recognition accuracy when being used for respectively building the classifier or the classification model aiming at different text types.
The foregoing description is only an overview of the technical solutions of the present application, and may be implemented according to the content of the specification in order to make the technical means of the present application more clearly understood, and in order to make the above-mentioned and other objects, features and advantages of the present application more clearly understood, the following detailed description of the present application will be given.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:
fig. 1 shows a schematic flow chart of a text classification method according to an embodiment of the present application;
FIG. 2 is a flow chart illustrating another text classification method according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of a text classification device according to an embodiment of the present application;
fig. 4 shows a schematic structural diagram of another text classification device according to an embodiment of the present application.
Detailed Description
The present application will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other.
In this embodiment, a text classification method is provided, as shown in fig. 1, and the method includes:
and step 101, acquiring text samples of different text types.
In this embodiment, taking some text types in identifying legal aspects as an example, text samples of different text types related to legal are obtained, where the text samples may be a segment of text, and the text types may specifically include the following types: theft, dangerous driving, holding others for drug, vending, fraud, slurs, money laundering, resale, smuggling, etc.
Step 102, dividing the text samples into a first sample set and a second sample set according to the number of the text samples of each text type, wherein the number of the text samples of any text type contained in the first sample set is smaller than a preset threshold value, and the number of the text samples of any text type contained in the second sample set is larger than or equal to the preset threshold value.
In the above obtained text samples, like some common crime types: the sample number of theft, dangerous driving, taking other people to take the poison is quite sufficient, and some unusual criminals: the sample number of defamation, money laundering, resale of cultural relics, smuggling weapon ammunition and the like is small. This results in sample data tilting, where a sufficient number of samples are marked as large samples and an insufficient number of samples are marked as small samples. The second sample set is established with all large samples and the first sample set is established with all small samples. The method comprises the steps of marking a large sample and a small sample according to a preset threshold, marking the large sample and the small sample as the small sample if the number of the text samples of a certain type is smaller than the preset threshold, and marking the large sample as the large sample if the number of the text samples of a certain type is larger than or equal to the preset threshold, so that a first sample set and a second sample set are established, and identification models of the text types contained in the first sample set and the text types contained in the second sample set are established respectively.
And step 103, extracting feature keywords from the text samples contained in the first sample set.
For text types with a small number of samples, to build a classifier that identifies these text types, feature keywords, which are typically phrases that often occur in the text samples contained in the first sample set, should first be extracted from all the text samples in the first sample set.
Step 104, calculating the classification contribution degree of the feature keywords to the text samples of each text type contained in the first sample set according to the text samples contained in the first sample set.
According to the extracted feature keywords, the classification contribution degree of each feature keyword to each text type contained in the first sample set is calculated, and the meaning of the classification contribution degree can be described as how much the probability that a certain feature keyword belongs to a certain text type is increased if the certain feature keyword appears in the certain text. For example, when a "heroin" appears in a piece of text, assuming that the classification contribution degree of the "heroin" to the text type of "drug" is 5% and the classification contribution degree of the "heroin" to the text type of "drug" is 4%, when the phrase of "heroin" appears in a piece of text, the probability that the piece of text belongs to the drug type increases by 5%, and the probability that the piece of text belongs to the drug type increases by 4%.
And 105, constructing a first text classifier according to the classification contribution degree.
After the classification contribution degree of each feature keyword to different text types is calculated, a first text classifier can be established according to the classification contribution degree, wherein the first text classifier is used for calculating the probability that the text to be identified belongs to each text type contained in a first sample set, for example, the first sample set contains text types such as defamation, money laundering, resale, smuggling weapon ammunition and the like, and the first text classifier can be used for calculating the probability that the text to be identified belongs to any one of the text types such as defaults, money laundering, resale, smuggling weapon ammunition and the like.
And step 106, training a second text classification model by using the second sample set.
For text types with sufficient sample number, a second text classification model may be trained by using a second sample set, where the second text classification model is used to calculate a probability that the text to be identified belongs to each text type included in the second sample set, for example, the second sample set includes text types such as theft, dangerous driving, and drug absorption by other people, and the probability that the text to be identified belongs to any one of the text types such as theft, dangerous driving, and drug absorption by other people may be calculated by using the trained second text classification model.
And step 107, classifying the text to be recognized according to the first text classifier and the second text classification model.
After the first text classifier and the second text classification model are obtained, the text to be identified can be classified, and the text type of the text to be identified can be identified, wherein the text to be identified can belong to the text type contained in the first sample set or the text type contained in the second sample set.
By applying the technical scheme of the embodiment, firstly, according to the size of the sample number, the text samples of each text type are used for establishing a first sample set by using the samples corresponding to the text types with the smaller sample number and establishing a second sample set by using the samples corresponding to the text types with sufficient sample number; then, a first text classifier and a second text classification model are established by utilizing a first sample set and a second sample set respectively aiming at the text type of a small sample and the text type of a large sample, wherein the first text classifier determines the classification contribution degree of the extracted characteristic keywords in the first sample set to each text type; and finally, realizing the text type identification of the text to be identified by using the first text classifier and the second text classification model. According to the method and the device, a first text classifier suitable for recognizing small sample text types and a second text classification model suitable for recognizing large sample text types are respectively built aiming at sample texts of different text types, and compared with the defect that in the prior art, only one model for recognizing all the text types is built, the text types with insufficient sample numbers are submerged in the text types with sufficient sample numbers, the method and the device are higher in recognition accuracy when being used for respectively building the classifier or the classification model aiming at different text types.
Further, as a refinement and extension of the foregoing embodiment, in order to fully describe the implementation procedure of this embodiment, another text classification method is provided, as shown in fig. 2, where the method includes:
step 201, obtaining text samples of different text types;
step 202, dividing the text samples into a first sample set and a second sample set according to the number of the text samples of each text type, wherein the number of the text samples of any text type contained in the first sample set is smaller than a preset threshold value, and the number of the text samples of any text type contained in the second sample set is larger than or equal to the preset threshold value.
According to the size of the text sample number corresponding to each text type, the text samples are divided into a first sample set and a second sample set, for example, the text samples corresponding to the text types with the text sample number smaller than 1000 are divided into the first sample set, and the text samples corresponding to the text types with the text sample number larger than or equal to 1000 are divided into the second sample set.
And 203, performing word segmentation processing on the text samples contained in the first sample set according to a preset phrase comparison table to extract feature words.
According to a preset phrase comparison table, the text samples in the first sample set are segmented, the preset phrase comparison table can be regarded as a dictionary, the dictionary contains a plurality of preset phrases which are beneficial to text classification, the word segmentation process is to extract phrases which are the same as the phrases in the dictionary from the text samples, and the extracted phrases are used as characteristic words of the text samples.
Step 204, counting the number of each feature word, and determining feature keywords according to the number of each feature word.
After extracting the feature words, the feature words need to be further screened, specifically, the number of occurrence times of each feature word in a text sample, namely the number of each feature word, is counted, then the feature words with more occurrence times are screened out to be used as feature keywords, for example, the feature words with 60% of the occurrence times of the feature words are used as the key feature words, 60% of the feature words can be adjusted to other values, and the method is not limited.
In addition, in order to avoid that the same feature word repeatedly appears in one or several text samples, the number of the feature word is higher than that of other feature words because of the one or several text samples, for example, phrase a appears 50 times in one text sample, appears only 10 times in other text samples, and phrase B appears 30 times in all text samples, it is obviously not reasonable if phrase a is selected as a key feature word if only one seat feature keyword can be selected in phrase a and phrase B, and the number of times of appearance of the feature word in each text sample is not considered. Therefore, in the embodiment of the present application, when the number of each feature word is counted, the same feature word in each text sample is counted only once, that is, the number of each feature word should be the number of the text samples containing the feature word.
Step 205, calculating the classification contribution degree of the feature keywords to the text samples of each text type contained in the first sample set according to the text samples contained in the first sample set.
Specifically, the classification contribution is calculated according to a classification contribution calculation formula, wherein the classification contribution calculation formula is that
Figure BDA0002056270810000071
Where i= (1, 2,) m, m being the number of text types contained in the first sample set, C i Represents the i-th text type, j= (1, 2,) and n, n being the number of feature keywords, X j Represents the j-th feature keyword, P (X j |C i ) Representing characteristic keyword X j In text type C i Probability of occurrence in text samples of (C), P (C) i ) Representing text type C i The ratio of the number of text samples in the first sample set to the number of all text samples in the first sample set, P (X) represents a predetermined coefficient, P (C) i |X j ) Representing characteristic keyword X j For text typesC i The classification contribution of the text sample of (c).
For example, C1 represents a defamation text type, C2 represents a money laundering text type, C3 represents a resale text type, and C4 represents a smuggling weapon ammunition text type.
Note that P (X j |C i ) The specific calculation method is exemplified as follows: assuming that there are 10 training samples of class labels Ci, the probability is 9/10=90% if the feature keyword Xj appears in 9 samples among the 10 samples, and is 3/10=30% if the feature keyword Xj appears in 3 samples among the 10 samples.
And 206, constructing a first text classifier according to the classification contribution degree.
Specifically, a first text classifier is constructed according to a first text classification formula, wherein the first text classification formula is that
Figure BDA0002056270810000081
Where k= (1, 2,) i, i is the number of feature words in the sample Y to be predicted, Y k The kth feature word, P (C i |y k ) Representing the characteristic word y k For text type C i Classification contribution of text sample of (a) if feature word y k And feature keyword X j Identical, then P (C i |y k )=P(C i |X j ) If the characteristic word y k And feature keyword X j Identical, then P (C i |y k )=0。
Specifically, assuming that the first sample set contains 20 feature keywords in total, then for any one text type Ci, P (C i |X j ) Then 20 values are represented, which represent the probability of the 20 feature keywords contributing to the classification of category Ci, and P (C i |y k ) Y in (a) k Representing characteristic words in a sample to be predicted, specifically according to presetExtracting the phrase comparison table, so that the condition that one or more of the feature words in the sample to be predicted are not in the feature keywords X corresponding to the first sample set can occur, and for the feature words in the sample to be predicted, which are inconsistent with the original feature keywords X, marking the classification tribute word donation probability of the feature words as 0 in the calculation.
Step 207, word segmentation processing is performed on the text samples in the second sample set, so as to obtain phrases corresponding to each text sample.
The text sample is subjected to word segmentation, illegal characters can be removed, stop word processing is performed, and the phrase corresponding to the four texts obtained after the word segmentation of the four texts is assumed to be respectively: text a [ "a" ]; text B [ "B", "c", "B" ]; text C [ "a", "C" ]; text D [ "c", "D" ].
Step 208, constructing a text vector corresponding to each text sample according to the phrase corresponding to each text sample.
The phrase corresponding to each text sample is constructed, the text sample selected in the embodiment of the application is generally short text, the text vector is generally controlled in a preset dimension, interception is performed if the text is longer than the preset dimension, and 0 element is used for supplementing if the text is shorter than the preset dimension.
Step 209, training a second text classification model by using the text vector and the text type of the text sample corresponding to the text vector, wherein the second text classification model is a convolutional neural network model.
The second text classification model may be a convolutional neural network model, or may be an SVM support vector machine model or other models that are currently used for text classification. Wherein, the architecture of convolutional neural network includes: convolution layer-pooling layer-full connection layer. The convolution layer is used as a feature extraction layer, text features are extracted through a filter, a feature map is generated through convolution kernel function operation and is output to a pooling layer, the pooling layer belongs to a feature mapping layer, downsampling is carried out on the feature map generated by the convolution layer, optimal features are output, the full-connection softmax layer is used for completing classification tasks, and classification probability of each text feature contained in a second sample set is output.
Step 210, word segmentation is performed on the text to be recognized, so as to obtain a phrase contained in the text to be recognized.
When the text type of the text to be recognized is recognized, firstly, word segmentation processing is carried out on the text to be recognized to obtain a plurality of phrases contained in the text to be recognized, and then the recognition of the text type is realized according to the phrases.
Step 211, converting the phrase contained in the text to be recognized into a word vector corresponding to the text to be recognized.
Since the text types contained in the second sample set are more common text types, whether the text to be recognized belongs to the text types contained in the second sample set or not is judged by utilizing the second text classification model, and particularly, word vector conversion is carried out on the word groups contained in the text to be recognized.
Step 212, inputting the word vector corresponding to the text to be recognized into the second text classification model to obtain the probability that the text to be recognized belongs to each text type contained in the second sample set.
And inputting the word vector corresponding to the text to be recognized into a second text classification model, wherein the second text classification model outputs the probability that the text to be recognized belongs to certain text types, and the certain text types refer to the text types contained in the text samples in the second sample set.
And 213, if the maximum value in the probabilities is greater than or equal to the preset classification probability, determining the text type corresponding to the maximum value in the probabilities as the text type of the text to be recognized.
And finding out the maximum probability value according to the calculated probability that the text to be recognized belongs to each text type contained in the second sample set, and if the maximum probability is greater than or equal to the preset classification probability, indicating that the probability that the text to be recognized belongs to the text type corresponding to the maximum probability value is very high, determining the text type corresponding to the maximum probability value as the text type of the text to be recognized.
Step 214, if the maximum value in the probabilities is smaller than the preset classification probability, determining the classification contribution degree corresponding to the phrase included in the text to be recognized according to the classification contribution degree corresponding to the feature keyword.
If the maximum probability value is smaller than the preset classification probability, the probability that the text to be identified belongs to the text type corresponding to the second sample set is lower, the text type of the text to be identified is identified by the first text classifier, and the classification contribution degree corresponding to the phrase extracted from the text to be identified is determined according to the calculated classification contribution degree corresponding to the characteristic key words. Specifically, if the phrase belongs to the above feature keyword, determining the classification contribution degree corresponding to the feature keyword as the classification contribution degree of the phrase, and if the phrase does not belong to the above feature keyword, determining the classification contribution degree of the phrase as 0.
Step 215, calculating the probability that the text to be recognized belongs to each text type contained in the first sample set according to the classification contribution degree corresponding to the phrase contained in the text to be recognized and the first text classifier.
Using a first text classifier, a probability is calculated that the text to be identified belongs to certain text types, where certain text types refer to text types corresponding to the first text set.
And step 216, determining the text type corresponding to the maximum value in the probability that the text to be recognized belongs to each text type contained in the first sample set as the text type of the text to be recognized.
And determining the text type corresponding to the maximum probability calculated by using the first text classifier as the text type of the text to be recognized. It should be noted that, the probability here is not an actual probability that the text to be recognized belongs to a certain text type, for example, if the probability that the text to be recognized belongs to the text type X is calculated to be 80%, and it does not represent that the actual probability that the text to be predicted belongs to the text type X is 80%, the probability is only used for comparing the probability with the probabilities that the text to be predicted belongs to other text types, so as to find the text type corresponding to the maximum probability and determine the text type as the text type of the text to be recognized.
According to the technical scheme, according to the number of text samples of different text types, a first sample set is built by using fewer text samples, a second sample set is built by using enough text samples, so that a first text classifier suitable for distinguishing the text types with insufficient sample numbers is built by using the first sample set, a second text classification model suitable for distinguishing the text types with sufficient sample numbers is trained by using the second sample set, the problem that the model is built inaccurately due to uneven distribution of different types of text samples is avoided, when the text to be recognized is classified, the text type of the text to be recognized is determined by using the second text classification model, and if the text to be recognized does not belong to the text type corresponding to the second sample set, the text type of the text to be recognized is recognized by using the first text classifier, so that the determination of the text type of the text to be recognized is realized, the problem that the recognition difficulty of the text type with fewer text samples is high is solved, and the overall recognition accuracy is improved.
Further, as a specific implementation of the method of fig. 1, an embodiment of the present application provides a text classification apparatus, as shown in fig. 3, where the apparatus includes: the system comprises a sample acquisition module 31, a sample set construction module 32, a keyword extraction module 33, a classification contribution calculation module 34, a first classifier construction module 35, a second classification model training module 36 and a classification module 37.
A sample acquiring module 31, configured to acquire text samples of different text types;
a sample set construction module 32, configured to divide the text samples into a first sample set and a second sample set according to the number of text samples of each text type, where the number of text samples of any text type included in the first sample set is less than a preset threshold value, and the number of text samples of any text type included in the second sample set is greater than or equal to the preset threshold value;
a keyword extraction module 33, configured to extract feature keywords from text samples included in the first sample set;
a classification contribution calculation module 34, configured to calculate, according to the text samples included in the first sample set, a classification contribution of the feature keyword to the text samples of each text type included in the first sample set;
a first classifier construction module 35, configured to construct a first text classifier according to the classification contribution;
a second classification model training module 36 for training a second text classification model using the second sample set;
the classification module 37 is configured to classify the text to be identified according to the first text classifier and the second text classification model.
In a specific application scenario, as shown in fig. 4, the keyword extraction module 33 specifically includes: a feature word extraction unit 331 and a keyword extraction unit 332.
The feature word extracting unit 331 is configured to perform word segmentation processing on a text sample included in the first sample set according to a preset phrase comparison table to extract feature words;
the keyword extraction unit 332 is configured to count the number of each feature word, and determine the feature keywords according to the number of each feature word.
The classification contribution calculating module 34 is specifically configured to calculate a classification contribution according to a classification contribution calculating formula, where the classification contribution calculating formula is
Figure BDA0002056270810000121
Where i= (1, 2,) m, m being the number of text types contained in the first sample set, C i Represents the i-th text type, j= (1, 2,) and n, n being the number of feature keywords, X j Represents the j-th feature keyword, P (X j |C i ) Representing characteristic keyword X j In text type C i Probability of occurrence in text samples of (C), P (C) i ) Representing text type C i The ratio of the number of text samples in the first sample set to the number of all text samples in the first sample set, P (X) represents a predetermined coefficient, P (C) i |X j ) Representing characteristic keyword X j For text type C i The classification contribution of the text sample of (c).
The first classifier construction module 35 is specifically configured to construct a first text classifier according to a first text classification formula, where the first text classification formula is
Figure BDA0002056270810000122
Where k= (1, 2,) i, i is the number of feature words in the sample Y to be predicted, Y k The kth feature word, P (C i |y k ) Representing the characteristic word y k For text type C i Classification contribution of text sample of (a) if feature word y k And feature keyword X j Identical, then P (C i |y k )=P(C i |X j ) If the characteristic word y k And feature keyword X j Identical, then P (C i |y k )=0。
In a specific application scenario, as shown in fig. 4, the second classification model training module 36 specifically includes: a first word segmentation unit 361, a first word vector construction unit 362, and a second classification model training unit 363.
A first word segmentation unit 361, configured to perform word segmentation on the text samples in the second sample set to obtain a phrase corresponding to each text sample;
a first word vector constructing unit 362, configured to construct a text vector corresponding to each text sample according to the phrase corresponding to each text sample;
a second classification model training unit 363 is configured to train the second text classification model by using the text vector and the text type of the text sample corresponding to the text vector, where the second text classification model is a convolutional neural network model.
In a specific application scenario, as shown in fig. 4, the classification module 37 specifically includes: a second word segmentation unit 371, a second word vector construction unit 372, a second text type recognition unit 373, a second text type determination unit 374, a classification contribution degree determination unit 375, a first text type recognition unit 376, and a first text type determination unit 377.
The second word segmentation unit 371 is used for performing word segmentation on the text to be recognized to obtain a phrase contained in the text to be recognized;
a second word vector construction unit 372, configured to convert a phrase included in the text to be recognized into a word vector corresponding to the text to be recognized;
a second text type recognition unit 373, configured to input a word vector corresponding to the text to be recognized into a second text classification model, so as to obtain a probability that the text to be recognized belongs to each text type included in the second sample set;
and a second text type determining unit 374, configured to determine, as the text type of the text to be recognized, the text type corresponding to the maximum value in the probabilities if the maximum value in the probabilities is greater than or equal to the preset classification probability.
The classification contribution degree determining unit 375 is configured to determine, if the maximum value of the probabilities is smaller than a preset classification probability, a classification contribution degree corresponding to a phrase included in the text to be identified according to the classification contribution degree corresponding to the feature keyword;
a first text type recognition unit 376, configured to calculate, according to a classification contribution corresponding to a phrase included in the text to be recognized and a first text classifier, a probability that the text to be recognized belongs to each text type included in the first sample set;
the first text type determining unit 377 is configured to determine, as the text type of the text to be recognized, the text type corresponding to the maximum value in the probabilities that the text to be recognized belongs to each text type included in the first sample set.
It should be noted that, for other corresponding descriptions of each functional unit related to the text classification device provided in the embodiment of the present application, reference may be made to corresponding descriptions in fig. 1 and fig. 2, and no further description is given here.
Based on the above-mentioned methods shown in fig. 1 and 2, correspondingly, the embodiments of the present application further provide a storage medium, on which a computer program is stored, which when executed by a processor, implements the above-mentioned text classification method shown in fig. 1 and 2.
Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.), and includes several instructions for causing a computer device (may be a personal computer, a server, or a network device, etc.) to perform the methods described in various implementation scenarios of the present application.
Based on the methods shown in fig. 1 and fig. 2 and the virtual device embodiments shown in fig. 3 and fig. 4, in order to achieve the above objects, the embodiments of the present application further provide a computer device, which may specifically be a personal computer, a server, a network device, etc., where the computer device includes a storage medium and a processor; a storage medium storing a computer program; a processor for executing a computer program to implement the text classification method as described above and shown in fig. 1 and 2.
Optionally, the computer device may also include a user interface, a network interface, a camera, radio frequency (RadioFrequency, RF) circuitry, sensors, audio circuitry, WI-FI modules, and the like. The user interface may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, etc. The network interface may optionally include a standard wired interface, a wireless interface (e.g., bluetooth interface, WI-FI interface), etc.
It will be appreciated by those skilled in the art that the architecture of a computer device provided in the present embodiment is not limited to the computer device, and may include more or fewer components, or may combine certain components, or may be arranged in different components.
The storage medium may also include an operating system, a network communication module. An operating system is a program that manages and saves computer device hardware and software resources, supporting the execution of information handling programs and other software and/or programs. The network communication module is used for realizing communication among all components in the storage medium and communication with other hardware and software in the entity equipment.
Through the description of the above embodiments, it can be clearly understood by those skilled in the art that the present application may be implemented by means of software plus a necessary general hardware platform, or may be implemented by hardware, where first text samples of each text type are according to the size of the number of samples, a first sample set is established by using samples corresponding to text types with a smaller number of samples, and a second sample set is established by using samples corresponding to text types with a sufficient number of samples; then, a first text classifier and a second text classification model are established by utilizing a first sample set and a second sample set respectively aiming at the text type of a small sample and the text type of a large sample, wherein the first text classifier determines the classification contribution degree of the extracted characteristic keywords in the first sample set to each text type; and finally, realizing the text type identification of the text to be identified by using the first text classifier and the second text classification model. According to the method and the device, a first text classifier suitable for recognizing small sample text types and a second text classification model suitable for recognizing large sample text types are respectively built aiming at sample texts of different text types, and compared with the defect that in the prior art, only one model for recognizing all the text types is built, the text types with insufficient sample numbers are submerged in the text types with sufficient sample numbers, the method and the device are higher in recognition accuracy when being used for respectively building the classifier or the classification model aiming at different text types.
Those skilled in the art will appreciate that the drawings are merely schematic illustrations of one preferred implementation scenario, and that the modules or flows in the drawings are not necessarily required to practice the present application. Those skilled in the art will appreciate that modules in an apparatus in an implementation scenario may be distributed in an apparatus in an implementation scenario according to an implementation scenario description, or that corresponding changes may be located in one or more apparatuses different from the implementation scenario. The modules of the implementation scenario may be combined into one module, or may be further split into a plurality of sub-modules.
The foregoing application serial numbers are merely for description, and do not represent advantages or disadvantages of the implementation scenario. The foregoing disclosure is merely a few specific implementations of the present application, but the present application is not limited thereto and any variations that can be considered by a person skilled in the art shall fall within the protection scope of the present application.

Claims (8)

1. A method of text classification, comprising:
acquiring text samples of different text types;
dividing the text samples into a first sample set and a second sample set according to the number of the text samples of each text type, wherein the number of the text samples of any text type contained in the first sample set is smaller than a preset threshold value, and the number of the text samples of any text type contained in the second sample set is larger than or equal to the preset threshold value;
according to a preset phrase comparison table, performing word segmentation processing on text samples contained in the first sample set to extract feature words; counting the number of each feature word, and determining the feature keywords according to the number of each feature word;
calculating the classification contribution degree of the characteristic keywords to the text samples of each text type contained in the first sample set according to the text samples contained in the first sample set;
constructing a first text classifier according to the classification contribution degree;
training a second text classification model using the second sample set;
classifying the text to be identified according to the first text classifier and the second text classification model;
wherein the classification contribution is calculated according to a classification contribution calculation formula, and the classification contribution calculation formula is that
Figure QLYQS_1
Figure QLYQS_3
M is the number of text types contained in said first sample set,/o>
Figure QLYQS_5
Represents the i text type,/->
Figure QLYQS_8
N is the number of feature keywords, < ->
Figure QLYQS_2
Represents the j-th feature keyword, ++>
Figure QLYQS_10
Representing feature keywords +.>
Figure QLYQS_12
In text type->
Figure QLYQS_13
Probability of occurrence in text samples of +.>
Figure QLYQS_6
Representing the text type +.>
Figure QLYQS_9
The ratio of the number of text samples in said first sample set to the number of all text samples in said first sample set,/for>
Figure QLYQS_11
The preset coefficient is indicated to be the same as the preset coefficient,
Figure QLYQS_14
representing the characteristic keyword ++>
Figure QLYQS_4
For the text type->
Figure QLYQS_7
The classification contribution of the text sample of (c).
2. The method according to claim 1, wherein constructing a first text classifier according to the classification contribution comprises:
constructing a first text classifier according to a first text classification formula, wherein the first text classification formula is that
Figure QLYQS_15
wherein ,
Figure QLYQS_17
l is the number of feature words in the sample Y to be predicted, < +.>
Figure QLYQS_20
K-th feature word representing sample Y to be predicted, < ->
Figure QLYQS_23
Representing characteristic words ++>
Figure QLYQS_18
For text type->
Figure QLYQS_22
Classification contribution of the text sample of (2) if the feature word +.>
Figure QLYQS_24
And the characteristic keyword->
Figure QLYQS_26
Identical->
Figure QLYQS_16
=/>
Figure QLYQS_21
If the characteristic word->
Figure QLYQS_25
And the characteristic key words
Figure QLYQS_27
Identical->
Figure QLYQS_19
=0。
3. The method according to claim 2, wherein training a second text classification model using the second sample set, in particular comprises:
word segmentation is carried out on the text samples in the second sample set, so that phrases corresponding to each text sample are obtained;
constructing a text vector corresponding to each text sample according to the phrase corresponding to each text sample;
and training the second text classification model by using the text vector and the text type of the text sample corresponding to the text vector, wherein the second text classification model is a convolutional neural network model.
4. A method according to claim 3, wherein classifying the text to be identified according to the first text classifier and the second text classification model comprises:
word segmentation is carried out on the text to be identified, so that a phrase contained in the text to be identified is obtained;
converting the word group contained in the text to be recognized into a word vector corresponding to the text to be recognized;
inputting word vectors corresponding to the texts to be recognized into the second text classification model to obtain the probability that the texts to be recognized belong to each text type contained in the second sample set;
if the maximum value in the probabilities is greater than or equal to a preset classification probability, determining the text type corresponding to the maximum value in the probabilities as the text type of the text to be identified.
5. The method of claim 4, wherein classifying the text to be identified according to the first text classifier and the second text classification model, in particular further comprises:
if the maximum value in the probabilities is smaller than the preset classification probability, determining the classification contribution degree corresponding to the phrase contained in the text to be identified according to the classification contribution degree corresponding to the characteristic key words;
calculating the probability that the text to be recognized belongs to each text type contained in the first sample set according to the classification contribution degree corresponding to the phrase contained in the text to be recognized and the first text classifier;
and determining the text type corresponding to the maximum value in the probability that the text to be identified belongs to each text type contained in the first sample set as the text type of the text to be identified.
6. A text classification device, comprising:
the sample acquisition module is used for acquiring text samples of different text types;
a sample set construction module, configured to divide the text samples into a first sample set and a second sample set according to the number of the text samples of each text type, where the number of the text samples of any text type included in the first sample set is less than a preset threshold value, and the number of the text samples of any text type included in the second sample set is greater than or equal to the preset threshold value;
the keyword extraction module is used for carrying out word segmentation processing on the text samples contained in the first sample set according to a preset phrase comparison table to extract feature words; counting the number of each feature word, and determining the feature keywords according to the number of each feature word;
the classification contribution degree calculation module is used for calculating the classification contribution degree of the characteristic keywords to the text samples of each text type contained in the first sample set according to the text samples contained in the first sample set;
the first classifier construction module is used for constructing a first text classifier according to the classification contribution degree;
the second classification model training module is used for training a second text classification model by using the second sample set;
the classification module is used for classifying the text to be identified according to the first text classifier and the second text classification model;
wherein the classification contribution is calculated according to a classification contribution calculation formula, and the classification contribution calculation formula is that
Figure QLYQS_28
Figure QLYQS_31
M is the number of text types contained in said first sample set,/o>
Figure QLYQS_33
Represents the i text type,/->
Figure QLYQS_38
N is the number of feature keywords, < ->
Figure QLYQS_29
Represents the j-th feature keyword, ++>
Figure QLYQS_36
Representing feature keywords +.>
Figure QLYQS_39
In text type->
Figure QLYQS_41
Probability of occurrence in text samples of +.>
Figure QLYQS_30
Representing the text type +.>
Figure QLYQS_34
The ratio of the number of text samples in said first sample set to the number of all text samples in said first sample set,/for>
Figure QLYQS_37
The preset coefficient is indicated to be the same as the preset coefficient,
Figure QLYQS_40
representing the characteristic keyword ++>
Figure QLYQS_32
For the text type->
Figure QLYQS_35
The classification contribution of the text sample of (c).
7. A storage medium having stored thereon a computer program, wherein the program when executed by a processor implements the text classification method of any of claims 1 to 5.
8. A computer device comprising a storage medium, a processor and a computer program stored on the storage medium and executable on the processor, characterized in that the processor implements the text classification method of any of claims 1 to 5 when executing the program.
CN201910390290.8A 2019-05-10 2019-05-10 Text classification method and device, storage medium and computer equipment Active CN110287311B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910390290.8A CN110287311B (en) 2019-05-10 2019-05-10 Text classification method and device, storage medium and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910390290.8A CN110287311B (en) 2019-05-10 2019-05-10 Text classification method and device, storage medium and computer equipment

Publications (2)

Publication Number Publication Date
CN110287311A CN110287311A (en) 2019-09-27
CN110287311B true CN110287311B (en) 2023-05-26

Family

ID=68001583

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910390290.8A Active CN110287311B (en) 2019-05-10 2019-05-10 Text classification method and device, storage medium and computer equipment

Country Status (1)

Country Link
CN (1) CN110287311B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111400499A (en) * 2020-03-24 2020-07-10 网易(杭州)网络有限公司 Training method of document classification model, document classification method, device and equipment
CN112329475B (en) * 2020-11-03 2022-05-20 海信视像科技股份有限公司 Statement processing method and device
CN113011503B (en) * 2021-03-17 2021-11-23 彭黎文 Data evidence obtaining method of electronic equipment, storage medium and terminal
CN113051385B (en) * 2021-04-28 2023-05-26 杭州网易再顾科技有限公司 Method, medium, device and computing equipment for intention recognition
CN116226382B (en) * 2023-02-28 2023-08-01 北京数美时代科技有限公司 Text classification method and device for given keywords, electronic equipment and medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009078096A1 (en) * 2007-12-18 2009-06-25 Fujitsu Limited Generating method of two class classification prediction model, program for generating classification prediction model and generating device of two class classification prediction model
CN102081627A (en) * 2009-11-27 2011-06-01 北京金山软件有限公司 Method and system for determining contribution degree of word in text
CN106294466A (en) * 2015-06-02 2017-01-04 富士通株式会社 Disaggregated model construction method, disaggregated model build equipment and sorting technique
CN106610949A (en) * 2016-09-29 2017-05-03 四川用联信息技术有限公司 Text feature extraction method based on semantic analysis
CN109492026A (en) * 2018-11-02 2019-03-19 国家计算机网络与信息安全管理中心 A kind of Telecoms Fraud classification and Detection method based on improved active learning techniques
CN109583474A (en) * 2018-11-01 2019-04-05 华中科技大学 A kind of training sample generation method for the processing of industrial big data

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009078096A1 (en) * 2007-12-18 2009-06-25 Fujitsu Limited Generating method of two class classification prediction model, program for generating classification prediction model and generating device of two class classification prediction model
CN102081627A (en) * 2009-11-27 2011-06-01 北京金山软件有限公司 Method and system for determining contribution degree of word in text
CN106294466A (en) * 2015-06-02 2017-01-04 富士通株式会社 Disaggregated model construction method, disaggregated model build equipment and sorting technique
CN106610949A (en) * 2016-09-29 2017-05-03 四川用联信息技术有限公司 Text feature extraction method based on semantic analysis
CN109583474A (en) * 2018-11-01 2019-04-05 华中科技大学 A kind of training sample generation method for the processing of industrial big data
CN109492026A (en) * 2018-11-02 2019-03-19 国家计算机网络与信息安全管理中心 A kind of Telecoms Fraud classification and Detection method based on improved active learning techniques

Also Published As

Publication number Publication date
CN110287311A (en) 2019-09-27

Similar Documents

Publication Publication Date Title
CN110287311B (en) Text classification method and device, storage medium and computer equipment
CN110287312B (en) Text similarity calculation method, device, computer equipment and computer storage medium
Goodfellow et al. Multi-digit number recognition from street view imagery using deep convolutional neural networks
CN110362677B (en) Text data category identification method and device, storage medium and computer equipment
WO2019179403A1 (en) Fraud transaction detection method based on sequence width depth learning
US8041120B2 (en) Unified digital ink recognition
CN107391760A (en) User interest recognition methods, device and computer-readable recording medium
CN108228845B (en) Mobile phone game classification method
CN110717554B (en) Image recognition method, electronic device, and storage medium
CN110457677B (en) Entity relationship identification method and device, storage medium and computer equipment
CN113254643B (en) Text classification method and device, electronic equipment and text classification program
US11720789B2 (en) Fast nearest neighbor search for output generation of convolutional neural networks
CN107180084A (en) Word library updating method and device
US10423817B2 (en) Latent fingerprint ridge flow map improvement
CN111260568B (en) Peak binarization background noise removing method based on multi-discriminator countermeasure network
CN106991312B (en) Internet anti-fraud authentication method based on voiceprint recognition
CN111694954B (en) Image classification method and device and electronic equipment
CN111062440B (en) Sample selection method, device, equipment and storage medium
CN106203508A (en) A kind of image classification method based on Hadoop platform
CN113094478B (en) Expression reply method, device, equipment and storage medium
CN110717407A (en) Human face recognition method, device and storage medium based on lip language password
CN114579743A (en) Attention-based text classification method and device and computer readable medium
CN114169439A (en) Abnormal communication number identification method and device, electronic equipment and readable medium
CN115795394A (en) Biological feature fusion identity recognition method for hierarchical multi-modal and advanced incremental learning
CN115713669A (en) Image classification method and device based on inter-class relation, storage medium and terminal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant