CN110287311B

CN110287311B - Text classification method and device, storage medium and computer equipment

Info

Publication number: CN110287311B
Application number: CN201910390290.8A
Authority: CN
Inventors: 钱柏丞
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-05-10
Filing date: 2019-05-10
Publication date: 2023-05-26
Anticipated expiration: 2039-05-10
Also published as: CN110287311A

Abstract

The application discloses a text classification method and device, a storage medium and computer equipment, wherein the method comprises the following steps: acquiring text samples of different text types; dividing the text samples into a first sample set and a second sample set according to the number of the text samples of each text type, wherein the number of the text samples of any text type contained in the first sample set is smaller than a preset threshold value, and the number of the text samples of any text type contained in the second sample set is larger than or equal to the preset threshold value; extracting characteristic keywords from text samples contained in the first sample set; calculating the classification contribution degree of the characteristic keywords to the text samples of each text type contained in the first sample set according to the text samples contained in the first sample set; constructing a first text classifier according to the classification contribution degree; training a second text classification model using the second sample set; and classifying the text to be identified according to the first text classifier and the second text classification model.

Description

Text classification method and device, storage medium and computer equipment

Technical Field

The present disclosure relates to the field of text classification technologies, and in particular, to a text classification method and apparatus, a storage medium, and a computer device.

Background

For the text classification problem in the field of natural language processing, the problem of inclination of training sample data is often encountered when training a machine learning or deep learning model, namely, the number of training samples of part of text types is sufficient, and the number of training samples of the other part of text types is smaller. Uneven distribution of training samples can cause model training bias, and the prediction difficulty of text types with fewer training samples is high, so that the overall prediction effect of the model is reduced.

The text classification training method in the prior art often ignores the problem, and performs a same-view on all samples or performs an oversampling strategy on small samples to supplement the samples. Neglecting this problem can lead to poor model classification, while the oversampling strategy scale is difficult to grasp, which easily results in overfitting and also cannot effectively promote model classification.

Disclosure of Invention

In view of the above, the present application provides a text classification method and apparatus, a storage medium, and a computer device, which respectively establish a classifier or a classification model for different text types with non-uniform sample number distribution, so that the recognition accuracy is higher.

According to one aspect of the present application, there is provided a text classification method, including:

acquiring text samples of different text types;

dividing the text samples into a first sample set and a second sample set according to the number of the text samples of each text type, wherein the number of the text samples of any text type contained in the first sample set is smaller than a preset threshold value, and the number of the text samples of any text type contained in the second sample set is larger than or equal to the preset threshold value;

extracting feature keywords from the text samples contained in the first sample set;

calculating the classification contribution degree of the characteristic keywords to the text samples of each text type contained in the first sample set according to the text samples contained in the first sample set;

constructing a first text classifier according to the classification contribution degree;

training a second text classification model using the second sample set;

and classifying the text to be recognized according to the first text classifier and the second text classification model.

According to another aspect of the present application, there is provided a text classification apparatus, including:

the sample acquisition module is used for acquiring text samples of different text types;

a sample set construction module, configured to divide the text samples into a first sample set and a second sample set according to the number of the text samples of each text type, where the number of the text samples of any text type included in the first sample set is smaller than a preset threshold value, and the number of the text samples of any text type included in the second sample set is greater than or equal to the preset threshold value;

the keyword extraction module is used for extracting characteristic keywords from the text samples contained in the first sample set;

the classification contribution degree calculation module is used for calculating the classification contribution degree of the characteristic keywords to the text samples of each text type contained in the first sample set according to the text samples contained in the first sample set;

the first classifier construction module is used for constructing a first text classifier according to the classification contribution degree;

the second classification model training module is used for training a second text classification model by using the second sample set;

and the classification module is used for classifying the text to be identified according to the first text classifier and the second text classification model.

According to still another aspect of the present application, there is provided a storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described text classification method.

According to still another aspect of the present application, there is provided a computer apparatus including a storage medium, a processor, and a computer program stored on the storage medium and executable on the processor, the processor implementing the above text classification method when executing the program.

By means of the technical scheme, the text classification method, the text classification device, the storage medium and the computer equipment provided by the application are characterized in that firstly, text samples of all text types are subjected to sample number size, a first sample set is built by using samples corresponding to text types with fewer sample numbers, and a second sample set is built by using samples corresponding to text types with sufficient sample numbers; then, a first text classifier and a second text classification model are established by utilizing a first sample set and a second sample set respectively aiming at the text type of a small sample and the text type of a large sample, wherein the first text classifier determines the classification contribution degree of the extracted characteristic keywords in the first sample set to each text type; and finally, realizing the text type identification of the text to be identified by using the first text classifier and the second text classification model. According to the method and the device, a first text classifier suitable for recognizing small sample text types and a second text classification model suitable for recognizing large sample text types are respectively built aiming at sample texts of different text types, and compared with the defect that in the prior art, only one model for recognizing all the text types is built, the text types with insufficient sample numbers are submerged in the text types with sufficient sample numbers, the method and the device are higher in recognition accuracy when being used for respectively building the classifier or the classification model aiming at different text types.

The foregoing description is only an overview of the technical solutions of the present application, and may be implemented according to the content of the specification in order to make the technical means of the present application more clearly understood, and in order to make the above-mentioned and other objects, features and advantages of the present application more clearly understood, the following detailed description of the present application will be given.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:

fig. 1 shows a schematic flow chart of a text classification method according to an embodiment of the present application;

FIG. 2 is a flow chart illustrating another text classification method according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a text classification device according to an embodiment of the present application;

fig. 4 shows a schematic structural diagram of another text classification device according to an embodiment of the present application.

Detailed Description

The present application will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other.

In this embodiment, a text classification method is provided, as shown in fig. 1, and the method includes:

and step 101, acquiring text samples of different text types.

In this embodiment, taking some text types in identifying legal aspects as an example, text samples of different text types related to legal are obtained, where the text samples may be a segment of text, and the text types may specifically include the following types: theft, dangerous driving, holding others for drug, vending, fraud, slurs, money laundering, resale, smuggling, etc.

Step 102, dividing the text samples into a first sample set and a second sample set according to the number of the text samples of each text type, wherein the number of the text samples of any text type contained in the first sample set is smaller than a preset threshold value, and the number of the text samples of any text type contained in the second sample set is larger than or equal to the preset threshold value.

In the above obtained text samples, like some common crime types: the sample number of theft, dangerous driving, taking other people to take the poison is quite sufficient, and some unusual criminals: the sample number of defamation, money laundering, resale of cultural relics, smuggling weapon ammunition and the like is small. This results in sample data tilting, where a sufficient number of samples are marked as large samples and an insufficient number of samples are marked as small samples. The second sample set is established with all large samples and the first sample set is established with all small samples. The method comprises the steps of marking a large sample and a small sample according to a preset threshold, marking the large sample and the small sample as the small sample if the number of the text samples of a certain type is smaller than the preset threshold, and marking the large sample as the large sample if the number of the text samples of a certain type is larger than or equal to the preset threshold, so that a first sample set and a second sample set are established, and identification models of the text types contained in the first sample set and the text types contained in the second sample set are established respectively.

And step 103, extracting feature keywords from the text samples contained in the first sample set.

For text types with a small number of samples, to build a classifier that identifies these text types, feature keywords, which are typically phrases that often occur in the text samples contained in the first sample set, should first be extracted from all the text samples in the first sample set.

Step 104, calculating the classification contribution degree of the feature keywords to the text samples of each text type contained in the first sample set according to the text samples contained in the first sample set.

According to the extracted feature keywords, the classification contribution degree of each feature keyword to each text type contained in the first sample set is calculated, and the meaning of the classification contribution degree can be described as how much the probability that a certain feature keyword belongs to a certain text type is increased if the certain feature keyword appears in the certain text. For example, when a "heroin" appears in a piece of text, assuming that the classification contribution degree of the "heroin" to the text type of "drug" is 5% and the classification contribution degree of the "heroin" to the text type of "drug" is 4%, when the phrase of "heroin" appears in a piece of text, the probability that the piece of text belongs to the drug type increases by 5%, and the probability that the piece of text belongs to the drug type increases by 4%.

And 105, constructing a first text classifier according to the classification contribution degree.

After the classification contribution degree of each feature keyword to different text types is calculated, a first text classifier can be established according to the classification contribution degree, wherein the first text classifier is used for calculating the probability that the text to be identified belongs to each text type contained in a first sample set, for example, the first sample set contains text types such as defamation, money laundering, resale, smuggling weapon ammunition and the like, and the first text classifier can be used for calculating the probability that the text to be identified belongs to any one of the text types such as defaults, money laundering, resale, smuggling weapon ammunition and the like.

And step 106, training a second text classification model by using the second sample set.

For text types with sufficient sample number, a second text classification model may be trained by using a second sample set, where the second text classification model is used to calculate a probability that the text to be identified belongs to each text type included in the second sample set, for example, the second sample set includes text types such as theft, dangerous driving, and drug absorption by other people, and the probability that the text to be identified belongs to any one of the text types such as theft, dangerous driving, and drug absorption by other people may be calculated by using the trained second text classification model.

And step 107, classifying the text to be recognized according to the first text classifier and the second text classification model.

After the first text classifier and the second text classification model are obtained, the text to be identified can be classified, and the text type of the text to be identified can be identified, wherein the text to be identified can belong to the text type contained in the first sample set or the text type contained in the second sample set.

By applying the technical scheme of the embodiment, firstly, according to the size of the sample number, the text samples of each text type are used for establishing a first sample set by using the samples corresponding to the text types with the smaller sample number and establishing a second sample set by using the samples corresponding to the text types with sufficient sample number; then, a first text classifier and a second text classification model are established by utilizing a first sample set and a second sample set respectively aiming at the text type of a small sample and the text type of a large sample, wherein the first text classifier determines the classification contribution degree of the extracted characteristic keywords in the first sample set to each text type; and finally, realizing the text type identification of the text to be identified by using the first text classifier and the second text classification model. According to the method and the device, a first text classifier suitable for recognizing small sample text types and a second text classification model suitable for recognizing large sample text types are respectively built aiming at sample texts of different text types, and compared with the defect that in the prior art, only one model for recognizing all the text types is built, the text types with insufficient sample numbers are submerged in the text types with sufficient sample numbers, the method and the device are higher in recognition accuracy when being used for respectively building the classifier or the classification model aiming at different text types.

Further, as a refinement and extension of the foregoing embodiment, in order to fully describe the implementation procedure of this embodiment, another text classification method is provided, as shown in fig. 2, where the method includes:

step 201, obtaining text samples of different text types;

step 202, dividing the text samples into a first sample set and a second sample set according to the number of the text samples of each text type, wherein the number of the text samples of any text type contained in the first sample set is smaller than a preset threshold value, and the number of the text samples of any text type contained in the second sample set is larger than or equal to the preset threshold value.

According to the size of the text sample number corresponding to each text type, the text samples are divided into a first sample set and a second sample set, for example, the text samples corresponding to the text types with the text sample number smaller than 1000 are divided into the first sample set, and the text samples corresponding to the text types with the text sample number larger than or equal to 1000 are divided into the second sample set.

And 203, performing word segmentation processing on the text samples contained in the first sample set according to a preset phrase comparison table to extract feature words.

According to a preset phrase comparison table, the text samples in the first sample set are segmented, the preset phrase comparison table can be regarded as a dictionary, the dictionary contains a plurality of preset phrases which are beneficial to text classification, the word segmentation process is to extract phrases which are the same as the phrases in the dictionary from the text samples, and the extracted phrases are used as characteristic words of the text samples.

Step 204, counting the number of each feature word, and determining feature keywords according to the number of each feature word.

After extracting the feature words, the feature words need to be further screened, specifically, the number of occurrence times of each feature word in a text sample, namely the number of each feature word, is counted, then the feature words with more occurrence times are screened out to be used as feature keywords, for example, the feature words with 60% of the occurrence times of the feature words are used as the key feature words, 60% of the feature words can be adjusted to other values, and the method is not limited.

In addition, in order to avoid that the same feature word repeatedly appears in one or several text samples, the number of the feature word is higher than that of other feature words because of the one or several text samples, for example, phrase a appears 50 times in one text sample, appears only 10 times in other text samples, and phrase B appears 30 times in all text samples, it is obviously not reasonable if phrase a is selected as a key feature word if only one seat feature keyword can be selected in phrase a and phrase B, and the number of times of appearance of the feature word in each text sample is not considered. Therefore, in the embodiment of the present application, when the number of each feature word is counted, the same feature word in each text sample is counted only once, that is, the number of each feature word should be the number of the text samples containing the feature word.

Step 205, calculating the classification contribution degree of the feature keywords to the text samples of each text type contained in the first sample set according to the text samples contained in the first sample set.

Specifically, the classification contribution is calculated according to a classification contribution calculation formula, wherein the classification contribution calculation formula is that

Where i= (1, 2,) m, m being the number of text types contained in the first sample set, C _i Represents the i-th text type, j= (1, 2,) and n, n being the number of feature keywords, X _j Represents the j-th feature keyword, P (X _j |C _i ) Representing characteristic keyword X _j In text type C _i Probability of occurrence in text samples of (C), P (C) _i ) Representing text type C _i The ratio of the number of text samples in the first sample set to the number of all text samples in the first sample set, P (X) represents a predetermined coefficient, P (C) _i |X _j ) Representing characteristic keyword X _j For text typesC _i The classification contribution of the text sample of (c).

For example, C1 represents a defamation text type, C2 represents a money laundering text type, C3 represents a resale text type, and C4 represents a smuggling weapon ammunition text type.

Note that P (X _j |C _i ) The specific calculation method is exemplified as follows: assuming that there are 10 training samples of class labels Ci, the probability is 9/10=90% if the feature keyword Xj appears in 9 samples among the 10 samples, and is 3/10=30% if the feature keyword Xj appears in 3 samples among the 10 samples.

And 206, constructing a first text classifier according to the classification contribution degree.

Specifically, a first text classifier is constructed according to a first text classification formula, wherein the first text classification formula is that

Where k= (1, 2,) i, i is the number of feature words in the sample Y to be predicted, Y _k The kth feature word, P (C _i |y _k ) Representing the characteristic word y _k For text type C _i Classification contribution of text sample of (a) if feature word y _k And feature keyword X _j Identical, then P (C _i |y _k )＝P(C _i |X _j ) If the characteristic word y _k And feature keyword X _j Identical, then P (C _i |y _k )＝0。

Specifically, assuming that the first sample set contains 20 feature keywords in total, then for any one text type Ci, P (C _i |X _j ) Then 20 values are represented, which represent the probability of the 20 feature keywords contributing to the classification of category Ci, and P (C _i |y _k ) Y in (a) _k Representing characteristic words in a sample to be predicted, specifically according to presetExtracting the phrase comparison table, so that the condition that one or more of the feature words in the sample to be predicted are not in the feature keywords X corresponding to the first sample set can occur, and for the feature words in the sample to be predicted, which are inconsistent with the original feature keywords X, marking the classification tribute word donation probability of the feature words as 0 in the calculation.

Step 207, word segmentation processing is performed on the text samples in the second sample set, so as to obtain phrases corresponding to each text sample.

The text sample is subjected to word segmentation, illegal characters can be removed, stop word processing is performed, and the phrase corresponding to the four texts obtained after the word segmentation of the four texts is assumed to be respectively: text a [ "a" ]; text B [ "B", "c", "B" ]; text C [ "a", "C" ]; text D [ "c", "D" ].

Step 208, constructing a text vector corresponding to each text sample according to the phrase corresponding to each text sample.

The phrase corresponding to each text sample is constructed, the text sample selected in the embodiment of the application is generally short text, the text vector is generally controlled in a preset dimension, interception is performed if the text is longer than the preset dimension, and 0 element is used for supplementing if the text is shorter than the preset dimension.

Step 209, training a second text classification model by using the text vector and the text type of the text sample corresponding to the text vector, wherein the second text classification model is a convolutional neural network model.

The second text classification model may be a convolutional neural network model, or may be an SVM support vector machine model or other models that are currently used for text classification. Wherein, the architecture of convolutional neural network includes: convolution layer-pooling layer-full connection layer. The convolution layer is used as a feature extraction layer, text features are extracted through a filter, a feature map is generated through convolution kernel function operation and is output to a pooling layer, the pooling layer belongs to a feature mapping layer, downsampling is carried out on the feature map generated by the convolution layer, optimal features are output, the full-connection softmax layer is used for completing classification tasks, and classification probability of each text feature contained in a second sample set is output.

Step 210, word segmentation is performed on the text to be recognized, so as to obtain a phrase contained in the text to be recognized.

When the text type of the text to be recognized is recognized, firstly, word segmentation processing is carried out on the text to be recognized to obtain a plurality of phrases contained in the text to be recognized, and then the recognition of the text type is realized according to the phrases.

Step 211, converting the phrase contained in the text to be recognized into a word vector corresponding to the text to be recognized.

Since the text types contained in the second sample set are more common text types, whether the text to be recognized belongs to the text types contained in the second sample set or not is judged by utilizing the second text classification model, and particularly, word vector conversion is carried out on the word groups contained in the text to be recognized.

Step 212, inputting the word vector corresponding to the text to be recognized into the second text classification model to obtain the probability that the text to be recognized belongs to each text type contained in the second sample set.

And inputting the word vector corresponding to the text to be recognized into a second text classification model, wherein the second text classification model outputs the probability that the text to be recognized belongs to certain text types, and the certain text types refer to the text types contained in the text samples in the second sample set.

And 213, if the maximum value in the probabilities is greater than or equal to the preset classification probability, determining the text type corresponding to the maximum value in the probabilities as the text type of the text to be recognized.

And finding out the maximum probability value according to the calculated probability that the text to be recognized belongs to each text type contained in the second sample set, and if the maximum probability is greater than or equal to the preset classification probability, indicating that the probability that the text to be recognized belongs to the text type corresponding to the maximum probability value is very high, determining the text type corresponding to the maximum probability value as the text type of the text to be recognized.

Step 214, if the maximum value in the probabilities is smaller than the preset classification probability, determining the classification contribution degree corresponding to the phrase included in the text to be recognized according to the classification contribution degree corresponding to the feature keyword.

If the maximum probability value is smaller than the preset classification probability, the probability that the text to be identified belongs to the text type corresponding to the second sample set is lower, the text type of the text to be identified is identified by the first text classifier, and the classification contribution degree corresponding to the phrase extracted from the text to be identified is determined according to the calculated classification contribution degree corresponding to the characteristic key words. Specifically, if the phrase belongs to the above feature keyword, determining the classification contribution degree corresponding to the feature keyword as the classification contribution degree of the phrase, and if the phrase does not belong to the above feature keyword, determining the classification contribution degree of the phrase as 0.

Step 215, calculating the probability that the text to be recognized belongs to each text type contained in the first sample set according to the classification contribution degree corresponding to the phrase contained in the text to be recognized and the first text classifier.

Using a first text classifier, a probability is calculated that the text to be identified belongs to certain text types, where certain text types refer to text types corresponding to the first text set.

And step 216, determining the text type corresponding to the maximum value in the probability that the text to be recognized belongs to each text type contained in the first sample set as the text type of the text to be recognized.

And determining the text type corresponding to the maximum probability calculated by using the first text classifier as the text type of the text to be recognized. It should be noted that, the probability here is not an actual probability that the text to be recognized belongs to a certain text type, for example, if the probability that the text to be recognized belongs to the text type X is calculated to be 80%, and it does not represent that the actual probability that the text to be predicted belongs to the text type X is 80%, the probability is only used for comparing the probability with the probabilities that the text to be predicted belongs to other text types, so as to find the text type corresponding to the maximum probability and determine the text type as the text type of the text to be recognized.

According to the technical scheme, according to the number of text samples of different text types, a first sample set is built by using fewer text samples, a second sample set is built by using enough text samples, so that a first text classifier suitable for distinguishing the text types with insufficient sample numbers is built by using the first sample set, a second text classification model suitable for distinguishing the text types with sufficient sample numbers is trained by using the second sample set, the problem that the model is built inaccurately due to uneven distribution of different types of text samples is avoided, when the text to be recognized is classified, the text type of the text to be recognized is determined by using the second text classification model, and if the text to be recognized does not belong to the text type corresponding to the second sample set, the text type of the text to be recognized is recognized by using the first text classifier, so that the determination of the text type of the text to be recognized is realized, the problem that the recognition difficulty of the text type with fewer text samples is high is solved, and the overall recognition accuracy is improved.

Further, as a specific implementation of the method of fig. 1, an embodiment of the present application provides a text classification apparatus, as shown in fig. 3, where the apparatus includes: the system comprises a sample acquisition module 31, a sample set construction module 32, a keyword extraction module 33, a classification contribution calculation module 34, a first classifier construction module 35, a second classification model training module 36 and a classification module 37.

A sample acquiring module 31, configured to acquire text samples of different text types;

a sample set construction module 32, configured to divide the text samples into a first sample set and a second sample set according to the number of text samples of each text type, where the number of text samples of any text type included in the first sample set is less than a preset threshold value, and the number of text samples of any text type included in the second sample set is greater than or equal to the preset threshold value;

a keyword extraction module 33, configured to extract feature keywords from text samples included in the first sample set;

a classification contribution calculation module 34, configured to calculate, according to the text samples included in the first sample set, a classification contribution of the feature keyword to the text samples of each text type included in the first sample set;

a first classifier construction module 35, configured to construct a first text classifier according to the classification contribution;

a second classification model training module 36 for training a second text classification model using the second sample set;

the classification module 37 is configured to classify the text to be identified according to the first text classifier and the second text classification model.

In a specific application scenario, as shown in fig. 4, the keyword extraction module 33 specifically includes: a feature word extraction unit 331 and a keyword extraction unit 332.

The feature word extracting unit 331 is configured to perform word segmentation processing on a text sample included in the first sample set according to a preset phrase comparison table to extract feature words;

the keyword extraction unit 332 is configured to count the number of each feature word, and determine the feature keywords according to the number of each feature word.

The classification contribution calculating module 34 is specifically configured to calculate a classification contribution according to a classification contribution calculating formula, where the classification contribution calculating formula is

Where i= (1, 2,) m, m being the number of text types contained in the first sample set, C _i Represents the i-th text type, j= (1, 2,) and n, n being the number of feature keywords, X _j Represents the j-th feature keyword, P (X _j |C _i ) Representing characteristic keyword X _j In text type C _i Probability of occurrence in text samples of (C), P (C) _i ) Representing text type C _i The ratio of the number of text samples in the first sample set to the number of all text samples in the first sample set, P (X) represents a predetermined coefficient, P (C) _i |X _j ) Representing characteristic keyword X _j For text type C _i The classification contribution of the text sample of (c).

The first classifier construction module 35 is specifically configured to construct a first text classifier according to a first text classification formula, where the first text classification formula is

In a specific application scenario, as shown in fig. 4, the second classification model training module 36 specifically includes: a first word segmentation unit 361, a first word vector construction unit 362, and a second classification model training unit 363.

A first word segmentation unit 361, configured to perform word segmentation on the text samples in the second sample set to obtain a phrase corresponding to each text sample;

a first word vector constructing unit 362, configured to construct a text vector corresponding to each text sample according to the phrase corresponding to each text sample;

a second classification model training unit 363 is configured to train the second text classification model by using the text vector and the text type of the text sample corresponding to the text vector, where the second text classification model is a convolutional neural network model.

In a specific application scenario, as shown in fig. 4, the classification module 37 specifically includes: a second word segmentation unit 371, a second word vector construction unit 372, a second text type recognition unit 373, a second text type determination unit 374, a classification contribution degree determination unit 375, a first text type recognition unit 376, and a first text type determination unit 377.

The second word segmentation unit 371 is used for performing word segmentation on the text to be recognized to obtain a phrase contained in the text to be recognized;

a second word vector construction unit 372, configured to convert a phrase included in the text to be recognized into a word vector corresponding to the text to be recognized;

a second text type recognition unit 373, configured to input a word vector corresponding to the text to be recognized into a second text classification model, so as to obtain a probability that the text to be recognized belongs to each text type included in the second sample set;

and a second text type determining unit 374, configured to determine, as the text type of the text to be recognized, the text type corresponding to the maximum value in the probabilities if the maximum value in the probabilities is greater than or equal to the preset classification probability.

The classification contribution degree determining unit 375 is configured to determine, if the maximum value of the probabilities is smaller than a preset classification probability, a classification contribution degree corresponding to a phrase included in the text to be identified according to the classification contribution degree corresponding to the feature keyword;

a first text type recognition unit 376, configured to calculate, according to a classification contribution corresponding to a phrase included in the text to be recognized and a first text classifier, a probability that the text to be recognized belongs to each text type included in the first sample set;

the first text type determining unit 377 is configured to determine, as the text type of the text to be recognized, the text type corresponding to the maximum value in the probabilities that the text to be recognized belongs to each text type included in the first sample set.

It should be noted that, for other corresponding descriptions of each functional unit related to the text classification device provided in the embodiment of the present application, reference may be made to corresponding descriptions in fig. 1 and fig. 2, and no further description is given here.

Based on the above-mentioned methods shown in fig. 1 and 2, correspondingly, the embodiments of the present application further provide a storage medium, on which a computer program is stored, which when executed by a processor, implements the above-mentioned text classification method shown in fig. 1 and 2.

Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.), and includes several instructions for causing a computer device (may be a personal computer, a server, or a network device, etc.) to perform the methods described in various implementation scenarios of the present application.

Based on the methods shown in fig. 1 and fig. 2 and the virtual device embodiments shown in fig. 3 and fig. 4, in order to achieve the above objects, the embodiments of the present application further provide a computer device, which may specifically be a personal computer, a server, a network device, etc., where the computer device includes a storage medium and a processor; a storage medium storing a computer program; a processor for executing a computer program to implement the text classification method as described above and shown in fig. 1 and 2.

Optionally, the computer device may also include a user interface, a network interface, a camera, radio frequency (RadioFrequency, RF) circuitry, sensors, audio circuitry, WI-FI modules, and the like. The user interface may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, etc. The network interface may optionally include a standard wired interface, a wireless interface (e.g., bluetooth interface, WI-FI interface), etc.

It will be appreciated by those skilled in the art that the architecture of a computer device provided in the present embodiment is not limited to the computer device, and may include more or fewer components, or may combine certain components, or may be arranged in different components.

The storage medium may also include an operating system, a network communication module. An operating system is a program that manages and saves computer device hardware and software resources, supporting the execution of information handling programs and other software and/or programs. The network communication module is used for realizing communication among all components in the storage medium and communication with other hardware and software in the entity equipment.

Through the description of the above embodiments, it can be clearly understood by those skilled in the art that the present application may be implemented by means of software plus a necessary general hardware platform, or may be implemented by hardware, where first text samples of each text type are according to the size of the number of samples, a first sample set is established by using samples corresponding to text types with a smaller number of samples, and a second sample set is established by using samples corresponding to text types with a sufficient number of samples; then, a first text classifier and a second text classification model are established by utilizing a first sample set and a second sample set respectively aiming at the text type of a small sample and the text type of a large sample, wherein the first text classifier determines the classification contribution degree of the extracted characteristic keywords in the first sample set to each text type; and finally, realizing the text type identification of the text to be identified by using the first text classifier and the second text classification model. According to the method and the device, a first text classifier suitable for recognizing small sample text types and a second text classification model suitable for recognizing large sample text types are respectively built aiming at sample texts of different text types, and compared with the defect that in the prior art, only one model for recognizing all the text types is built, the text types with insufficient sample numbers are submerged in the text types with sufficient sample numbers, the method and the device are higher in recognition accuracy when being used for respectively building the classifier or the classification model aiming at different text types.

Those skilled in the art will appreciate that the drawings are merely schematic illustrations of one preferred implementation scenario, and that the modules or flows in the drawings are not necessarily required to practice the present application. Those skilled in the art will appreciate that modules in an apparatus in an implementation scenario may be distributed in an apparatus in an implementation scenario according to an implementation scenario description, or that corresponding changes may be located in one or more apparatuses different from the implementation scenario. The modules of the implementation scenario may be combined into one module, or may be further split into a plurality of sub-modules.

The foregoing application serial numbers are merely for description, and do not represent advantages or disadvantages of the implementation scenario. The foregoing disclosure is merely a few specific implementations of the present application, but the present application is not limited thereto and any variations that can be considered by a person skilled in the art shall fall within the protection scope of the present application.

Claims

1. A method of text classification, comprising:

acquiring text samples of different text types;

according to a preset phrase comparison table, performing word segmentation processing on text samples contained in the first sample set to extract feature words; counting the number of each feature word, and determining the feature keywords according to the number of each feature word;

training a second text classification model using the second sample set;

classifying the text to be identified according to the first text classifier and the second text classification model;

wherein the classification contribution is calculated according to a classification contribution calculation formula, and the classification contribution calculation formula is that

，

M is the number of text types contained in said first sample set,/o>

Represents the i text type,/->

N is the number of feature keywords, < ->

Represents the j-th feature keyword, ++>

Representing feature keywords +.>

In text type->

Probability of occurrence in text samples of +.>

Representing the text type +.>

The ratio of the number of text samples in said first sample set to the number of all text samples in said first sample set,/for>

The preset coefficient is indicated to be the same as the preset coefficient,

representing the characteristic keyword ++>

For the text type->

The classification contribution of the text sample of (c).

2. The method according to claim 1, wherein constructing a first text classifier according to the classification contribution comprises:

constructing a first text classifier according to a first text classification formula, wherein the first text classification formula is that

，

wherein ,

l is the number of feature words in the sample Y to be predicted, < +.>

K-th feature word representing sample Y to be predicted, < ->

Representing characteristic words ++>

For text type->

Classification contribution of the text sample of (2) if the feature word +.>

And the characteristic keyword->

Identical->

=/>

If the characteristic word->

And the characteristic key words

Identical->

=0。

3. The method according to claim 2, wherein training a second text classification model using the second sample set, in particular comprises:

word segmentation is carried out on the text samples in the second sample set, so that phrases corresponding to each text sample are obtained;

constructing a text vector corresponding to each text sample according to the phrase corresponding to each text sample;

and training the second text classification model by using the text vector and the text type of the text sample corresponding to the text vector, wherein the second text classification model is a convolutional neural network model.

4. A method according to claim 3, wherein classifying the text to be identified according to the first text classifier and the second text classification model comprises:

word segmentation is carried out on the text to be identified, so that a phrase contained in the text to be identified is obtained;

converting the word group contained in the text to be recognized into a word vector corresponding to the text to be recognized;

inputting word vectors corresponding to the texts to be recognized into the second text classification model to obtain the probability that the texts to be recognized belong to each text type contained in the second sample set;

if the maximum value in the probabilities is greater than or equal to a preset classification probability, determining the text type corresponding to the maximum value in the probabilities as the text type of the text to be identified.

5. The method of claim 4, wherein classifying the text to be identified according to the first text classifier and the second text classification model, in particular further comprises:

if the maximum value in the probabilities is smaller than the preset classification probability, determining the classification contribution degree corresponding to the phrase contained in the text to be identified according to the classification contribution degree corresponding to the characteristic key words;

calculating the probability that the text to be recognized belongs to each text type contained in the first sample set according to the classification contribution degree corresponding to the phrase contained in the text to be recognized and the first text classifier;

and determining the text type corresponding to the maximum value in the probability that the text to be identified belongs to each text type contained in the first sample set as the text type of the text to be identified.

6. A text classification device, comprising:

a sample set construction module, configured to divide the text samples into a first sample set and a second sample set according to the number of the text samples of each text type, where the number of the text samples of any text type included in the first sample set is less than a preset threshold value, and the number of the text samples of any text type included in the second sample set is greater than or equal to the preset threshold value;

the keyword extraction module is used for carrying out word segmentation processing on the text samples contained in the first sample set according to a preset phrase comparison table to extract feature words; counting the number of each feature word, and determining the feature keywords according to the number of each feature word;

the classification module is used for classifying the text to be identified according to the first text classifier and the second text classification model;

，