CN110287311A - File classification method and device, storage medium, computer equipment - Google Patents

File classification method and device, storage medium, computer equipment Download PDF

Info

Publication number
CN110287311A
CN110287311A CN201910390290.8A CN201910390290A CN110287311A CN 110287311 A CN110287311 A CN 110287311A CN 201910390290 A CN201910390290 A CN 201910390290A CN 110287311 A CN110287311 A CN 110287311A
Authority
CN
China
Prior art keywords
text
samples
classification
sample set
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910390290.8A
Other languages
Chinese (zh)
Other versions
CN110287311B (en
Inventor
钱柏丞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201910390290.8A priority Critical patent/CN110287311B/en
Publication of CN110287311A publication Critical patent/CN110287311A/en
Application granted granted Critical
Publication of CN110287311B publication Critical patent/CN110287311B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Abstract

This application discloses a kind of file classification method and device, storage medium, computer equipments, this method comprises: obtaining the samples of text of different text types;According to the quantity of the samples of text of each text type, samples of text is divided into first sample set and the second sample set, wherein, first quantity for practicing the samples of text for any text type for including in sample set is less than preset threshold, and the quantity of the samples of text for any text type for including in the second sample set is greater than or equal to preset threshold;Characteristic key words are extracted from the samples of text that first sample set includes;The samples of text for including according to first sample set calculates the classification contribution degree that characteristic key words concentrate the samples of text for each text type for including to first sample;According to classification contribution degree, the first text classifier is constructed;Utilize the second sample set, the second textual classification model of training;According to the first text classifier and the second textual classification model, classify to text to be identified.

Description

Text classification method and device, storage medium and computer equipment
Technical Field
The present application relates to the field of text classification technologies, and in particular, to a text classification method and apparatus, a storage medium, and a computer device.
Background
For the problem of text classification in the field of natural language processing, when a machine learning or deep learning model is trained, the problem of training sample data inclination is often encountered, that is, the number of training samples of a part of text types is sufficient, and the number of training samples of another part of text types is small. The uneven distribution of the training samples can cause model training bias, and the difficulty in predicting the text types with small number of training samples is high, so that the overall prediction effect of the model is reduced.
In the text classification training method in the prior art, the problem is often ignored, and all samples are treated identically, or the samples are supplemented by an oversampling strategy for small samples. Neglecting this problem can lead to the model classification effect not good, and oversampling strategy scale is difficult to grasp and causes the overfitting easily and also can't effectively promote the model classification effect.
Disclosure of Invention
In view of this, the present application provides a text classification method and apparatus, a storage medium, and a computer device, which respectively establish classifiers or classification models for different text types with non-uniform sample number distribution, and have higher identification accuracy.
According to an aspect of the present application, there is provided a text classification method, including:
acquiring text samples of different text types;
dividing the text samples into a first sample set and a second sample set according to the number of the text samples of each text type, wherein the number of the text samples of any text type contained in the first sample set is smaller than a preset threshold value, and the number of the text samples of any text type contained in the second sample set is larger than or equal to the preset threshold value;
extracting feature keywords from the text samples contained in the first sample set;
calculating the classification contribution degree of the feature keywords to the text samples of each text type contained in the first sample set according to the text samples contained in the first sample set;
constructing a first text classifier according to the classification contribution degree;
training a second text classification model by using the second sample set;
and classifying the texts to be recognized according to the first text classifier and the second text classification model.
According to another aspect of the present application, there is provided a text classification apparatus, comprising:
the sample acquisition module is used for acquiring text samples of different text types;
a sample set construction module, configured to divide the text samples into a first sample set and a second sample set according to the number of the text samples of each text type, where the number of the text samples of any text type included in the first sample set is smaller than a preset threshold, and the number of the text samples of any text type included in the second sample set is greater than or equal to the preset threshold;
a keyword extraction module, configured to extract feature keywords from the text samples included in the first sample set;
a classification contribution degree calculation module, configured to calculate, according to the text samples included in the first sample set, a classification contribution degree of the feature keyword to a text sample of each text type included in the first sample set;
the first classifier building module is used for building a first text classifier according to the classification contribution degree;
the second classification model training module is used for training a second text classification model by utilizing the second sample set;
and the classification module is used for classifying the texts to be recognized according to the first text classifier and the second text classification model.
According to yet another aspect of the application, a storage medium is provided, on which a computer program is stored, which program, when being executed by a processor, carries out the above-mentioned text classification method.
According to yet another aspect of the present application, there is provided a computer device comprising a storage medium, a processor, and a computer program stored on the storage medium and executable on the processor, the processor implementing the text classification method when executing the program.
By means of the technical scheme, the text classification method and device, the storage medium and the computer equipment provided by the application are characterized in that firstly, according to the size of the number of samples, the text samples of each text type are used for establishing a first sample set by using the samples corresponding to the text types with less samples and establishing a second sample set by using the samples corresponding to the text types with sufficient samples; then, respectively utilizing a first sample set and a second sample set to establish a first text classifier and a second text classification model aiming at the text type of the small sample and the text type of the large sample, wherein the first text classifier is determined according to the classification contribution degree of the feature keywords extracted from the first sample set to each text type; and finally, recognizing the text type of the text to be recognized by utilizing the first text classifier and the second text classification model. The method and the device for identifying the small sample text type and the large sample text type respectively establish the first text classifier suitable for identifying the small sample text type and the second text classification model suitable for identifying the large sample text type aiming at the sample texts of different text types, and compared with the defect that only one model for identifying all text types is established in the prior art, the text type with insufficient sample number is submerged in the text type with sufficient sample number, the method and the device for identifying the large sample text type and the large sample text type respectively establish the classifiers or classification models aiming at different text types, and the identification accuracy is higher.
The foregoing description is only an overview of the technical solutions of the present application, and the present application can be implemented according to the content of the description in order to make the technical means of the present application more clearly understood, and the following detailed description of the present application is given in order to make the above and other objects, features, and advantages of the present application more clearly understandable.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
fig. 1 is a flowchart illustrating a text classification method provided in an embodiment of the present application;
FIG. 2 is a flow chart of another text classification method provided in the embodiments of the present application;
fig. 3 is a schematic structural diagram illustrating a text classification apparatus according to an embodiment of the present application;
fig. 4 shows a schematic structural diagram of another text classification apparatus provided in the embodiment of the present application.
Detailed Description
The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
In this embodiment, a text classification method is provided, as shown in fig. 1, the method includes:
step 101, obtaining text samples of different text types.
The embodiment of the present application is described by taking identification of some text types in the legal aspect as an example, and obtaining text samples of different text types related to the law, where the text sample may be a segment of text, and the text types may specifically include the following types: theft, dangerous driving, holding others for drug inhalation, drug selling, fraud, slurs, money washing, resale cultural relics, smuggling weapons and ammunition, and the like.
Step 102, dividing the text samples into a first sample set and a second sample set according to the number of the text samples of each text type, wherein the number of the text samples of any text type contained in the first sample set is smaller than a preset threshold, and the number of the text samples of any text type contained in the second sample set is larger than or equal to the preset threshold.
In the above-mentioned text sample obtained, like some common crime types: the number of samples such as theft, dangerous driving, others for drug absorption and the like is sufficient, and some unusual crimes are as follows: the number of samples such as defamation, money washing, reselling cultural relics, smuggling ammunition and the like is small. This results in a sample data skew where a sufficient number of samples are marked as large samples and an insufficient number of samples are marked as small samples. The second set of samples is created using all of the large samples and the first set of samples is created using all of the small samples. The basis for marking the large samples and the small samples is a preset threshold, if the number of the text samples of a certain type is smaller than the preset threshold, the text samples are marked as the small samples, and if the number of the text samples of the certain type is larger than or equal to the preset threshold, the text samples are marked as the large samples, so that a first sample set and a second sample set are established, and a recognition model of the text type contained in the first sample set and the text type contained in the second sample set is established respectively.
Step 103, extracting feature keywords from the text samples contained in the first sample set.
For text types with a small number of samples, in order to establish a classifier for identifying the text types, first, feature keywords are extracted from all text samples in the first sample set, and the feature keywords are generally word groups frequently appearing in the text samples included in the first sample set.
And 104, calculating the classification contribution degree of the feature keywords to the text samples of each text type contained in the first sample set according to the text samples contained in the first sample set.
And calculating the classification contribution degree of each feature keyword to each text type contained in the first sample set according to the extracted feature keywords, wherein the meaning of the classification contribution degree can be described as how much the probability that a certain text belongs to a certain text type is increased if the certain feature keyword appears in the text. For example, if "heroin" appears in a text and assuming that the degree of contribution of "heroin" to the classification of the text type "drug addiction" is 5% and the degree of contribution of "heroin" to the classification of the text type "drug vendor" is 4%, when the phrase "heroin" appears in a text, the probability that the text belongs to the drug addiction type increases by 5% and the probability that the text belongs to the drug vendor type increases by 4%.
And 105, constructing a first text classifier according to the classification contribution degree.
After the classification contribution degree of each feature keyword to different text types is calculated, a first text classifier can be established according to the classification contribution degree, wherein the first text classifier is used for calculating the probability that the text to be recognized belongs to each text type contained in the first sample set, for example, the first sample set contains text types such as defamation, money washing, backsize cultural relics and smuggle ammunition, and the probability that the text to be recognized belongs to any one of text types such as defamation, money washing, backsize cultural relics and smuggle ammunition can be calculated by using the first text classifier.
And 106, training a second text classification model by using the second sample set.
For text types with sufficient samples, a second text classification model may be trained by using a second sample set, where the second text classification model is used to calculate probabilities that the text to be recognized belongs to each text type included in the second sample set, for example, the second sample set includes text types such as theft, dangerous driving, holding others and taking drugs, and the trained second text classification model may be used to calculate probabilities that the text to be recognized belongs to any one of text types such as theft, dangerous driving, holding others and taking drugs.
And 107, classifying the texts to be recognized according to the first text classifier and the second text classification model.
After the first text classifier and the second text classification model are obtained, the text to be recognized can be classified, and the text type of the text to be recognized is recognized, wherein the text to be recognized can belong to the text type contained in the first sample set or the text type contained in the second sample set.
By applying the technical scheme of the embodiment, firstly, according to the size of the sample number, the text samples of each text type are used for establishing a first sample set by using the sample corresponding to the text type with less sample number and establishing a second sample set by using the sample corresponding to the text type with sufficient sample number; then, respectively utilizing a first sample set and a second sample set to establish a first text classifier and a second text classification model aiming at the text type of the small sample and the text type of the large sample, wherein the first text classifier is determined according to the classification contribution degree of the feature keywords extracted from the first sample set to each text type; and finally, recognizing the text type of the text to be recognized by utilizing the first text classifier and the second text classification model. The method and the device for identifying the small sample text type and the large sample text type respectively establish the first text classifier suitable for identifying the small sample text type and the second text classification model suitable for identifying the large sample text type aiming at the sample texts of different text types, and compared with the defect that only one model for identifying all text types is established in the prior art, the text type with insufficient sample number is submerged in the text type with sufficient sample number, the method and the device for identifying the large sample text type and the large sample text type respectively establish the classifiers or classification models aiming at different text types, and the identification accuracy is higher.
Further, as a refinement and an extension of the specific implementation of the above embodiment, in order to fully illustrate the specific implementation process of the embodiment, another text classification method is provided, as shown in fig. 2, and the method includes:
step 201, acquiring text samples of different text types;
step 202, dividing the text samples into a first sample set and a second sample set according to the number of the text samples of each text type, wherein the number of the text samples of any text type contained in the first sample set is smaller than a preset threshold, and the number of the text samples of any text type contained in the second sample set is greater than or equal to the preset threshold.
Dividing the text samples into a first sample set and a second sample set according to the number of the text samples corresponding to each text type, for example, dividing the text samples corresponding to the text types with the number of the text samples smaller than 1000 into the first sample set, and dividing the text samples corresponding to the text types with the number of the text samples greater than or equal to 1000 into the second sample set.
Step 203, performing word segmentation processing on the text samples contained in the first sample set according to a preset phrase comparison table to extract feature words.
And performing word segmentation on the text sample in the first sample set according to a preset word group comparison table, wherein the preset word group comparison table can be regarded as a dictionary, the dictionary comprises a plurality of preset word groups beneficial to text classification, the word segmentation processing is to extract the same word groups in the text sample and the dictionary, and the extracted word groups are used as characteristic words of the text sample.
And 204, counting the quantity of each feature word, and determining feature keywords according to the quantity of each feature word.
After the feature words are extracted, the feature words need to be further screened, specifically, the number of times of occurrence of each feature word in a text sample, that is, the number of each feature word needs to be counted, and then the feature words with a large number of occurrences are screened out to be used as the feature keywords, for example, the feature words 60% before the number of occurrences of the feature words are used as the key feature words, and 60% can be adjusted to other values, which is not limited herein.
In addition, in order to avoid that the same feature word repeatedly appears in one or a plurality of text samples, so that the number of the feature word is higher than that of other feature words due to the one or the plurality of text samples, for example, a phrase a appears 50 times in one text sample, and appears 10 times in other text samples altogether, and a phrase B appears 30 times in all text samples, if the phrases a and B can only select one seat feature keyword, if the number of occurrences of the feature word in each text sample is not considered, the phrase a is screened as a key feature word, which is obviously unreasonable. Therefore, when the number of each feature word is counted in the embodiment of the application, the same feature word in each text sample is counted only once, that is, the number of each feature word should be the number of the text samples containing the feature word.
Step 205, calculating the classification contribution degree of the feature keyword to the text sample of each text type contained in the first sample set according to the text samples contained in the first sample set.
Specifically, the classification contribution degree is calculated according to a classification contribution degree calculation formula, wherein the classification contribution degree calculation formula is
Where i ═ m (1, 2.... times.m), m being the number of text types contained in the first sample set, CiDenotes the ith text type, j ═ 1, 2.... n), n is the number of feature keywords, X is the number of feature keywordsjDenotes the jth feature keyword, P (X)j|Ci) Representing a feature keyword XjIn text type CiThe probability of occurrence in the text sample of (1), P (C)i) Represents text type CiThe number of text samples in the first sample set is proportional to the number of all text samples in the first sample set, P (X) represents a preset coefficient, and P (C)i|Xj) Representing a feature keyword XjFor text type CiThe classification contribution of the text sample.
For example, C1 represents defamation text type, C2 represents money laundering text type, C3 represents backsize cultural relic text type, and C4 represents smuggling ammunition text type.
In addition, P (X)j|Ci) The specific calculation method is as follows: assuming that there are 10 training samples with class labels Ci, it is counted whether the feature keyword Xj appears in the 10 samples, if the feature keyword Xj appears in 9 samples, the probability is 9/10 ═ 90%, and if the feature keyword Xj appears in 3 samples, the probability is 3/10 ═ 30%.
And step 206, constructing a first text classifier according to the classification contribution degree.
Specifically, a first text classifier is constructed according to a first text classification formula, wherein the first text classification formula is
Where, k is (1, 2..., l), l is the number of feature words in the sample Y to be predicted, and Y is the number of feature words in the sample Y to be predictedkThe k-th feature word, P (C), representing the sample Y to be predictedi|yk) Meaning a characteristic word ykFor text type CiIf the feature word y is the classification contribution degree of the text samplekAnd feature keyword XjSame, then P (C)i|yk)=P(Ci|Xj) If the feature word ykAnd feature keyword XjSame, then P (C)i|yk)=0。
In particular, assuming that the first sample set contains a total of 20 feature keywords, then for any one text type Ci, P (C)i|Xj) 20 values are represented, and the 20 values respectively represent the probabilities that the 20 feature keywords contribute to the classification with the classification Ci, and P (C)i|yk) Y in (1)kThe characteristic words in the sample to be predicted are represented, and the extraction can be specifically carried out according to a preset word group comparison table, so that the situation that one or more of the characteristic words in the sample to be predicted are not in the characteristic keywords X corresponding to the first sample set can occur, and for the characteristic words which are inconsistent with the original characteristic keywords X in the sample to be predicted, the probability of the classification contribution words of the characteristic words is recorded as 0 in the calculation process.
And step 207, performing word segmentation processing on the text samples in the second sample set to obtain a word group corresponding to each text sample.
Performing word segmentation on the text sample, in addition, removing illegal characters and performing stop word processing, and assuming that phrases corresponding to four texts obtained after four text word segmentation are respectively: text A [ "a" ]; text B [ "B", "c", "B" ]; text C [ "a", "C" ]; the text D [ "c", "D" ].
And step 208, constructing a text vector corresponding to each text sample according to the phrase corresponding to each text sample.
And constructing a text vector corresponding to each text sample from the phrases corresponding to each text sample, wherein the text sample selected in the embodiment of the application is generally a short text, the text vector is generally controlled in a preset dimension, the text vector is intercepted if the text length is longer than the preset dimension, and the text vector is supplemented by 0 element if the text length is shorter than the preset dimension.
And step 209, training a second text classification model by using the text vectors and the text types of the text samples corresponding to the text vectors, wherein the second text classification model is a convolutional neural network model.
The second text classification model may be a convolutional neural network model, or may be an SVM support vector machine model or other models that are currently used for text classification. The architecture of the convolutional neural network comprises: convolutional layer-pooling layer-full-link layer. The convolutional layer is used as a feature extraction layer, text features are extracted through a filter, a feature graph is generated through convolution kernel function operation and output to the pooling layer, the pooling layer belongs to a feature mapping layer, the feature graph generated by the convolutional layer is subjected to down-sampling to output optimal features, the feature graph is fully connected with the softmax layer to complete classification tasks, and classification probability of each text feature contained in the second sample set is output.
And step 210, performing word segmentation processing on the text to be recognized to obtain a word group contained in the text to be recognized.
When the text type of the text to be recognized is recognized, firstly, the text to be recognized is subjected to word segmentation processing to obtain a plurality of word groups contained in the text to be recognized, and then the text type is recognized according to the word groups.
Step 211, converting the phrases contained in the text to be recognized into word vectors corresponding to the text to be recognized.
Since the text type included in the second sample set is a relatively common text type, the second text classification model is first used to determine whether the text to be recognized belongs to the text type included in the second sample set, and specifically, word vector conversion is first performed on a word group included in the text to be recognized.
Step 212, inputting the word vector corresponding to the text to be recognized into the second text classification model, and obtaining the probability that the text to be recognized belongs to each text type contained in the second sample set.
And inputting the word vector corresponding to the text to be recognized into a second text classification model, wherein the second text classification model outputs the probability that the text to be recognized belongs to certain text types, and the certain text types refer to the text types contained in the text samples in the second sample set.
Step 213, if the maximum value in the probabilities is greater than or equal to the preset classification probability, determining the text type corresponding to the maximum value in the probabilities as the text type of the text to be recognized.
And finding out the maximum probability value according to the calculated probability that the text to be recognized belongs to each text type contained in the second sample set, wherein if the maximum probability is greater than or equal to the preset classification probability, the probability that the text to be recognized belongs to the text type corresponding to the maximum probability value is very high, and the text type corresponding to the maximum probability value can be determined as the text type of the text to be recognized.
In step 214, if the maximum value in the probabilities is smaller than the preset classification probability, the classification contribution degree corresponding to the phrase contained in the text to be recognized is determined according to the classification contribution degree corresponding to the feature keyword.
And if the maximum probability value is smaller than the preset classification probability, the probability that the text to be recognized belongs to the text type corresponding to the second sample set is low, the text type of the text to be recognized is recognized by using the first text classifier, and the classification contribution degree corresponding to the phrase extracted from the text to be recognized is determined according to the calculated classification contribution degree corresponding to the feature keyword. Specifically, if a phrase belongs to the above feature keyword, the classification contribution degree corresponding to the feature keyword is determined as the classification contribution degree of the phrase, and if the phrase does not belong to the above feature keyword, the classification contribution degree of the phrase is determined as 0.
Step 215, calculating the probability that the text to be recognized belongs to each text type contained in the first sample set according to the classification contribution degree corresponding to the word group contained in the text to be recognized and the first text classifier.
And calculating the probability that the text to be recognized belongs to certain text types by using the first text classifier, wherein the certain text types refer to the text types corresponding to the first text set.
Step 216, determining the text type corresponding to the maximum value in the probability that the text to be recognized belongs to each text type contained in the first sample set as the text type of the text to be recognized.
And determining the text type corresponding to the maximum probability calculated by using the first text classifier as the text type of the text to be recognized. It should be noted that the probability is not the actual probability that the text to be recognized belongs to a certain text type, for example, if the probability that the text to be recognized belongs to the text type X is 80% and does not represent that the actual probability that the text to be predicted belongs to the text type X is 80%, the probability is only used for comparing the size of the text to be predicted with the probability that the text to be predicted belongs to other text types, so as to find out the text type corresponding to the maximum probability and determine the text type as the text type of the text to be recognized.
By applying the technical scheme of the embodiment, according to the number of text samples of different text types, a first sample set is established by using a small number of text samples, a second sample set is established by using a sufficient number of text samples, so that a first text classifier suitable for distinguishing the text type with the insufficient number of samples is established by using the first sample set, a second text classification model suitable for distinguishing the text type with the sufficient number of samples is trained by using the second sample set, the problem that the model establishment is inaccurate due to uneven distribution of different types of text samples is avoided, further, when the text to be recognized is classified, the text type of the text to be recognized is determined by using the second text classification model, if the text to be recognized does not belong to the text type corresponding to the second sample set, the text type of the text to be recognized is recognized by using the first text classifier, so as to determine the text type of the text to be recognized, the problem of the difficult recognition of the text type with a small number of text samples is solved, and the overall recognition accuracy is improved.
Further, as a specific implementation of the method in fig. 1, an embodiment of the present application provides a text classification apparatus, as shown in fig. 3, the apparatus includes: the system comprises a sample acquisition module 31, a sample set construction module 32, a keyword extraction module 33, a classification contribution degree calculation module 34, a first classifier construction module 35, a second classification model training module 36 and a classification module 37.
A sample obtaining module 31, configured to obtain text samples of different text types;
the sample set constructing module 32 is configured to divide the text samples into a first sample set and a second sample set according to the number of the text samples of each text type, where the number of the text samples of any text type included in the first sample set is smaller than a preset threshold, and the number of the text samples of any text type included in the second sample set is greater than or equal to the preset threshold;
a keyword extraction module 33, configured to extract feature keywords from the text samples included in the first sample set;
a classification contribution degree calculation module 34, configured to calculate, according to the text samples included in the first sample set, a classification contribution degree of the feature keyword to the text sample of each text type included in the first sample set;
a first classifier building module 35, configured to build a first text classifier according to the classification contribution degree;
a second classification model training module 36, configured to train a second text classification model using a second sample set;
and the classification module 37 is configured to classify the text to be recognized according to the first text classifier and the second text classification model.
In a specific application scenario, as shown in fig. 4, the keyword extraction module 33 specifically includes: a feature word extraction unit 331 and a keyword extraction unit 332.
The feature word extraction unit 331 is configured to perform word segmentation processing on the text samples included in the first sample set according to a preset phrase comparison table to extract feature words;
the keyword extraction unit 332 is configured to count the number of each feature word, and determine a feature keyword according to the number of each feature word.
A classification contribution degree calculating module 34, specifically configured to calculate the classification contribution degree according to a classification contribution degree calculating formula, where the calculating formula of the classification contribution degree is
Where i ═ m (1, 2.... times.m), m being the number of text types contained in the first sample set, CiDenotes the ith text type, j ═ 1, 2.... n), n is the number of feature keywords, X is the number of feature keywordsjDenotes the jth feature keyword, P (X)j|Ci) Representing a feature keyword XjIn text type CiThe probability of occurrence in the text sample of (1), P (C)i) Represents text type CiThe number of text samples in the first set of samples is proportional to the number of total text samples in the first set of samples, P (X) represents pre-predictionLet coefficient, P (C)i|Xj) Representing a feature keyword XjFor text type CiThe classification contribution of the text sample.
The first classifier building module 35 is specifically configured to build a first text classifier according to a first text classification formula, where the first text classification formula is
Where, k is (1, 2..., l), l is the number of feature words in the sample Y to be predicted, and Y is the number of feature words in the sample Y to be predictedkThe k-th feature word, P (C), representing the sample Y to be predictedi|yk) Meaning a characteristic word ykFor text type CiIf the feature word y is the classification contribution degree of the text samplekAnd feature keyword XjSame, then P (C)i|yk)=P(Ci|Xj) If the feature word ykAnd feature keyword XjSame, then P (C)i|yk)=0。
In a specific application scenario, as shown in fig. 4, the second classification model training module 36 specifically includes: a first word segmentation unit 361, a first word vector construction unit 362 and a second classification model training unit 363.
A first segmentation unit 361, configured to perform segmentation processing on the text samples in the second sample set to obtain a phrase corresponding to each text sample;
a first word vector constructing unit 362, configured to construct a text vector corresponding to each text sample according to the phrase corresponding to each text sample;
and the second classification model training unit 363 is configured to train the second text classification model by using the text vector and the text type of the text sample corresponding to the text vector, where the second text classification model is a convolutional neural network model.
In a specific application scenario, as shown in fig. 4, the classification module 37 specifically includes: a second word segmentation unit 371, a second word vector construction unit 372, a second text type identification unit 373, a second text type determination unit 374, a classification contribution degree determination unit 375, a first text type identification unit 376 and a first text type determination unit 377.
The second word segmentation unit 371 is configured to perform word segmentation on the text to be recognized, so as to obtain a word group included in the text to be recognized;
a second word vector construction unit 372, configured to convert a word group included in the text to be recognized into a word vector corresponding to the text to be recognized;
the second text type identification unit 373 is configured to input the word vector corresponding to the text to be identified into the second text classification model, so as to obtain a probability that the text to be identified belongs to each text type included in the second sample set;
and a second text type determining unit 374, configured to determine, if the maximum value in the probabilities is greater than or equal to the preset classification probability, the text type corresponding to the maximum value in the probabilities as the text type of the text to be recognized.
The classification contribution degree determining unit 375 is configured to determine, according to the classification contribution degree corresponding to the feature keyword, a classification contribution degree corresponding to a phrase included in the text to be recognized, if a maximum value in the probabilities is smaller than a preset classification probability;
a first text type identifying unit 376, configured to calculate, according to the classification contribution degree corresponding to the word group included in the text to be identified and the first text classifier, a probability that the text to be identified belongs to each text type included in the first sample set;
the first text type determining unit 377 is configured to determine the text type corresponding to the maximum value of the probabilities that the text to be recognized belongs to each text type included in the first sample set, as the text type of the text to be recognized.
It should be noted that other corresponding descriptions of the functional units related to the text classification device provided in the embodiment of the present application may refer to the corresponding descriptions in fig. 1 and fig. 2, and are not described herein again.
Based on the method shown in fig. 1 and fig. 2, correspondingly, the embodiment of the present application further provides a storage medium, on which a computer program is stored, and the program, when executed by a processor, implements the text classification method shown in fig. 1 and fig. 2.
Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the implementation scenarios of the present application.
Based on the method shown in fig. 1 and fig. 2 and the virtual device embodiment shown in fig. 3 and fig. 4, in order to achieve the above object, an embodiment of the present application further provides a computer device, which may specifically be a personal computer, a server, a network device, and the like, where the computer device includes a storage medium and a processor; a storage medium for storing a computer program; a processor for executing a computer program to implement the text classification method as described above with reference to fig. 1 and 2.
Optionally, the computer device may also include a user interface, a network interface, a camera, Radio Frequency (RF) circuitry, sensors, audio circuitry, a WI-FI module, and so forth. The user interface may include a Display screen (Display), an input unit such as a keypad (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, etc. The network interface may optionally include a standard wired interface, a wireless interface (e.g., a bluetooth interface, WI-FI interface), etc.
It will be appreciated by those skilled in the art that the present embodiment provides a computer device architecture that is not limiting of the computer device, and that may include more or fewer components, or some components in combination, or a different arrangement of components.
The storage medium may further include an operating system and a network communication module. An operating system is a program that manages and maintains the hardware and software resources of a computer device, supporting the operation of information handling programs, as well as other software and/or programs. The network communication module is used for realizing communication among components in the storage medium and other hardware and software in the entity device.
Through the description of the above embodiments, those skilled in the art can clearly understand that the present application can be implemented by software plus a necessary general hardware platform, or can first implement, according to the size of the sample number, the text samples of each text type by hardware, and establish a first sample set by using the samples corresponding to the text types with a smaller sample number and establish a second sample set by using the samples corresponding to the text types with a sufficient sample number; then, respectively utilizing a first sample set and a second sample set to establish a first text classifier and a second text classification model aiming at the text type of the small sample and the text type of the large sample, wherein the first text classifier is determined according to the classification contribution degree of the feature keywords extracted from the first sample set to each text type; and finally, recognizing the text type of the text to be recognized by utilizing the first text classifier and the second text classification model. The method and the device for identifying the small sample text type and the large sample text type respectively establish the first text classifier suitable for identifying the small sample text type and the second text classification model suitable for identifying the large sample text type aiming at the sample texts of different text types, and compared with the defect that only one model for identifying all text types is established in the prior art, the text type with insufficient sample number is submerged in the text type with sufficient sample number, the method and the device for identifying the large sample text type and the large sample text type respectively establish the classifiers or classification models aiming at different text types, and the identification accuracy is higher.
Those skilled in the art will appreciate that the figures are merely schematic representations of one preferred implementation scenario and that the blocks or flow diagrams in the figures are not necessarily required to practice the present application. Those skilled in the art will appreciate that the modules in the devices in the implementation scenario may be distributed in the devices in the implementation scenario according to the description of the implementation scenario, or may be located in one or more devices different from the present implementation scenario with corresponding changes. The modules of the implementation scenario may be combined into one module, or may be further split into a plurality of sub-modules.
The above application serial numbers are for description purposes only and do not represent the superiority or inferiority of the implementation scenarios. The above disclosure is only a few specific implementation scenarios of the present application, but the present application is not limited thereto, and any variations that can be made by those skilled in the art are intended to fall within the scope of the present application.

Claims (10)

1. A method of text classification, comprising:
acquiring text samples of different text types;
dividing the text samples into a first sample set and a second sample set according to the number of the text samples of each text type, wherein the number of the text samples of any text type contained in the first sample set is smaller than a preset threshold value, and the number of the text samples of any text type contained in the second sample set is larger than or equal to the preset threshold value;
extracting feature keywords from the text samples contained in the first sample set;
calculating the classification contribution degree of the feature keywords to the text samples of each text type contained in the first sample set according to the text samples contained in the first sample set;
constructing a first text classifier according to the classification contribution degree;
training a second text classification model by using the second sample set;
and classifying the texts to be recognized according to the first text classifier and the second text classification model.
2. The method according to claim 1, wherein the extracting feature keywords from the text samples included in the first sample set specifically comprises:
performing word segmentation processing on the text samples contained in the first sample set according to a preset word group comparison table to extract feature words;
and counting the number of each feature word, and determining the feature keywords according to the number of each feature word.
3. The method according to claim 2, wherein said calculating, according to the text samples included in the first sample set, a classification contribution of the feature keyword to the text sample of each text type included in the first sample set specifically comprises:
calculating the classification contribution degree according to a classification contribution degree calculation formula, wherein the calculation formula of the classification contribution degree is
Wherein i ═ m (1, 2.... m), m being the number of text types contained in the first set of samplesAmount, CiDenotes the ith text type, j ═ 1, 2.... times.n), n is the number of the feature keywords, X isjDenotes the jth feature keyword, P (X)j|Ci) Representing a feature keyword XjIn text type CiThe probability of occurrence in the text sample of (1), P (C)i) Represents the text type CiThe number of text samples in the first set of samples is proportional to the number of all text samples in the first set of samples, P (x) represents a predetermined coefficient, P (C)i|Xj) Representing the feature keyword XjFor the text type CiThe classification contribution of the text sample.
4. The method according to claim 3, wherein constructing a first text classifier according to the classification contribution degree comprises:
constructing the first text classifier according to a first text classification formula, wherein the first text classification formula is
Where, k is (1, 2..., l), l is the number of feature words in the sample Y to be predicted, and Y is the number of feature words in the sample Y to be predictedkThe k-th feature word, P (C), representing the sample Y to be predictedi|yk) Meaning a characteristic word ykFor text type CiIf the feature word y is the classification contribution degree of the text samplekAnd the feature keyword XjSame, then P (C)i|yk)=P(Ci|Xj) If the feature word ykAnd the feature keyword XjSame, then P (C)i|yk)=0。
5. The method of claim 4, wherein training a second text classification model using the second sample set comprises:
performing word segmentation processing on the text samples in the second sample set to obtain a word group corresponding to each text sample;
constructing a text vector corresponding to each text sample according to the phrase corresponding to each text sample;
and training the second text classification model by using the text vector and the text type of the text sample corresponding to the text vector, wherein the second text classification model is a convolutional neural network model.
6. The method according to claim 5, wherein the classifying the text to be recognized according to the first text classifier and the second text classification model specifically includes:
performing word segmentation processing on the text to be recognized to obtain word groups contained in the text to be recognized;
converting phrases contained in the text to be recognized into word vectors corresponding to the text to be recognized;
inputting the word vector corresponding to the text to be recognized into the second text classification model to obtain the probability that the text to be recognized belongs to each text type contained in the second sample set;
and if the maximum value in the probabilities is larger than or equal to the preset classification probability, determining the text type corresponding to the maximum value in the probabilities as the text type of the text to be recognized.
7. The method according to claim 6, wherein the classifying the text to be recognized according to the first text classifier and the second text classification model further comprises:
if the maximum value in the probabilities is smaller than the preset classification probability, determining the classification contribution degree corresponding to the phrase contained in the text to be recognized according to the classification contribution degree corresponding to the feature key words;
calculating the probability that the text to be recognized belongs to each text type contained in the first sample set according to the classification contribution degree corresponding to the word group contained in the text to be recognized and the first text classifier;
and determining the text type corresponding to the maximum value in the probability that the text to be recognized belongs to each text type contained in the first sample set as the text type of the text to be recognized.
8. A text classification apparatus, comprising:
the sample acquisition module is used for acquiring text samples of different text types;
a sample set construction module, configured to divide the text samples into a first sample set and a second sample set according to the number of the text samples of each text type, where the number of the text samples of any text type included in the first sample set is smaller than a preset threshold, and the number of the text samples of any text type included in the second sample set is greater than or equal to the preset threshold;
a keyword extraction module, configured to extract feature keywords from the text samples included in the first sample set;
a classification contribution degree calculation module, configured to calculate, according to the text samples included in the first sample set, a classification contribution degree of the feature keyword to a text sample of each text type included in the first sample set;
the first classifier building module is used for building a first text classifier according to the classification contribution degree;
the second classification model training module is used for training a second text classification model by utilizing the second sample set;
and the classification module is used for classifying the texts to be recognized according to the first text classifier and the second text classification model.
9. A storage medium on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the text classification method of any one of claims 1 to 7.
10. A computer device comprising a storage medium, a processor and a computer program stored on the storage medium and executable on the processor, wherein the processor implements the text classification method of any one of claims 1 to 7 when executing the program.
CN201910390290.8A 2019-05-10 2019-05-10 Text classification method and device, storage medium and computer equipment Active CN110287311B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910390290.8A CN110287311B (en) 2019-05-10 2019-05-10 Text classification method and device, storage medium and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910390290.8A CN110287311B (en) 2019-05-10 2019-05-10 Text classification method and device, storage medium and computer equipment

Publications (2)

Publication Number Publication Date
CN110287311A true CN110287311A (en) 2019-09-27
CN110287311B CN110287311B (en) 2023-05-26

Family

ID=68001583

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910390290.8A Active CN110287311B (en) 2019-05-10 2019-05-10 Text classification method and device, storage medium and computer equipment

Country Status (1)

Country Link
CN (1) CN110287311B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111400499A (en) * 2020-03-24 2020-07-10 网易(杭州)网络有限公司 Training method of document classification model, document classification method, device and equipment
CN112329475A (en) * 2020-11-03 2021-02-05 海信视像科技股份有限公司 Statement processing method and device
CN113011503A (en) * 2021-03-17 2021-06-22 彭黎文 Data evidence obtaining method of electronic equipment, storage medium and terminal
CN113051385A (en) * 2021-04-28 2021-06-29 杭州网易再顾科技有限公司 Intention recognition method, medium, device and computing equipment
CN116226382A (en) * 2023-02-28 2023-06-06 北京数美时代科技有限公司 Text classification method and device for given keywords, electronic equipment and medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009078096A1 (en) * 2007-12-18 2009-06-25 Fujitsu Limited Generating method of two class classification prediction model, program for generating classification prediction model and generating device of two class classification prediction model
CN102081627A (en) * 2009-11-27 2011-06-01 北京金山软件有限公司 Method and system for determining contribution degree of word in text
CN106294466A (en) * 2015-06-02 2017-01-04 富士通株式会社 Disaggregated model construction method, disaggregated model build equipment and sorting technique
CN106610949A (en) * 2016-09-29 2017-05-03 四川用联信息技术有限公司 Text feature extraction method based on semantic analysis
CN109492026A (en) * 2018-11-02 2019-03-19 国家计算机网络与信息安全管理中心 A kind of Telecoms Fraud classification and Detection method based on improved active learning techniques
CN109583474A (en) * 2018-11-01 2019-04-05 华中科技大学 A kind of training sample generation method for the processing of industrial big data

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009078096A1 (en) * 2007-12-18 2009-06-25 Fujitsu Limited Generating method of two class classification prediction model, program for generating classification prediction model and generating device of two class classification prediction model
CN102081627A (en) * 2009-11-27 2011-06-01 北京金山软件有限公司 Method and system for determining contribution degree of word in text
CN106294466A (en) * 2015-06-02 2017-01-04 富士通株式会社 Disaggregated model construction method, disaggregated model build equipment and sorting technique
CN106610949A (en) * 2016-09-29 2017-05-03 四川用联信息技术有限公司 Text feature extraction method based on semantic analysis
CN109583474A (en) * 2018-11-01 2019-04-05 华中科技大学 A kind of training sample generation method for the processing of industrial big data
CN109492026A (en) * 2018-11-02 2019-03-19 国家计算机网络与信息安全管理中心 A kind of Telecoms Fraud classification and Detection method based on improved active learning techniques

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111400499A (en) * 2020-03-24 2020-07-10 网易(杭州)网络有限公司 Training method of document classification model, document classification method, device and equipment
CN112329475A (en) * 2020-11-03 2021-02-05 海信视像科技股份有限公司 Statement processing method and device
CN112329475B (en) * 2020-11-03 2022-05-20 海信视像科技股份有限公司 Statement processing method and device
CN113011503A (en) * 2021-03-17 2021-06-22 彭黎文 Data evidence obtaining method of electronic equipment, storage medium and terminal
CN113011503B (en) * 2021-03-17 2021-11-23 彭黎文 Data evidence obtaining method of electronic equipment, storage medium and terminal
CN113051385A (en) * 2021-04-28 2021-06-29 杭州网易再顾科技有限公司 Intention recognition method, medium, device and computing equipment
CN116226382A (en) * 2023-02-28 2023-06-06 北京数美时代科技有限公司 Text classification method and device for given keywords, electronic equipment and medium
CN116226382B (en) * 2023-02-28 2023-08-01 北京数美时代科技有限公司 Text classification method and device for given keywords, electronic equipment and medium

Also Published As

Publication number Publication date
CN110287311B (en) 2023-05-26

Similar Documents

Publication Publication Date Title
CN108628971B (en) Text classification method, text classifier and storage medium for unbalanced data set
CN110287311A (en) File classification method and device, storage medium, computer equipment
CN109446517B (en) Reference resolution method, electronic device and computer readable storage medium
CN109471944B (en) Training method and device of text classification model and readable storage medium
CN110362677B (en) Text data category identification method and device, storage medium and computer equipment
CN107391760A (en) User interest recognition methods, device and computer-readable recording medium
CN107704495A (en) Training method, device and the computer-readable recording medium of subject classification device
US20200065573A1 (en) Generating variations of a known shred
CN107180084B (en) Word bank updating method and device
CN110807314A (en) Text emotion analysis model training method, device and equipment and readable storage medium
JP6897749B2 (en) Learning methods, learning systems, and learning programs
US20170076152A1 (en) Determining a text string based on visual features of a shred
CN113254643B (en) Text classification method and device, electronic equipment and text classification program
CN110046648B (en) Method and device for classifying business based on at least one business classification model
WO2022121163A1 (en) User behavior tendency identification method, apparatus, and device, and storage medium
CN108133224B (en) Method for evaluating complexity of classification task
CN111694954B (en) Image classification method and device and electronic equipment
CN114579743A (en) Attention-based text classification method and device and computer readable medium
CN106203508A (en) A kind of image classification method based on Hadoop platform
CN110717407A (en) Human face recognition method, device and storage medium based on lip language password
CN104966109A (en) Medical laboratory report image classification method and apparatus
CN115713669B (en) Image classification method and device based on inter-class relationship, storage medium and terminal
CN109657710A (en) Data screening method, apparatus, server and storage medium
CN111930885B (en) Text topic extraction method and device and computer equipment
CN111783088B (en) Malicious code family clustering method and device and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant