CN111209399A

CN111209399A - Text classification method and device and electronic equipment

Info

Publication number: CN111209399A
Application number: CN202010001393.3A
Authority: CN
Inventors: 甄建静; 王悦林
Original assignee: Lenovo Beijing Ltd
Current assignee: Lenovo Beijing Ltd
Priority date: 2020-01-02
Filing date: 2020-01-02
Publication date: 2020-05-29

Abstract

The application discloses a text classification method, a text classification device and electronic equipment, wherein the method comprises the following steps: obtaining a text to be classified; inputting the text into a trained text classification model to obtain probability values output by a plurality of two classifiers in the text classification model, wherein the probability values output by the two classifiers represent the probability that the text belongs to a regular text class corresponding to the two classifiers; obtaining a constructed error correction coding table, wherein the error correction coding table at least comprises: the code bit values between the text categories and the two classifiers are used for indicating whether the text categories belong to the regular text categories corresponding to the two classifiers; determining a target text category meeting the matching condition from the plurality of text categories according to the probability values output by the plurality of second classifiers and the code bit values of the text categories and the plurality of second classifiers in the error correction coding table; the text is classified into a target text category. The scheme of the application can improve the text classification accuracy.

Description

Text classification method and device and electronic equipment

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a text classification method and apparatus, and an electronic device.

Background

The text classification refers to classifying and marking the text according to a certain classification system or standard so as to determine the category to which the text belongs. Text classification has been applied to spam filtering, sentiment analysis, and other fields.

However, the existing text classification method generally has the defects of poor text classification accuracy and failure to achieve a good classification effect.

Disclosure of Invention

The application aims to provide a text classification method, a text classification device and electronic equipment so as to improve the accuracy of text classification.

In order to achieve the purpose, the application provides the following technical scheme:

a method of text classification, comprising:

obtaining a text to be classified;

inputting the text into a trained text classification model to obtain probability values output by a plurality of two classifiers in the text classification model, wherein the probability values output by the two classifiers represent the probability that the text belongs to the regular text classes corresponding to the two classifiers;

obtaining a constructed error correction coding table, wherein the error correction coding table at least comprises: the correspondence between a plurality of text categories and code bit values among the plurality of second classifiers is used for indicating whether the text categories belong to the regular text categories corresponding to the second classifiers;

determining a target text category meeting matching conditions from the text categories according to the probability values output by the two classifiers and the code bit values corresponding to the text categories and the two classifiers in the error correction coding table;

classifying the text into the target text category.

Preferably, the determining, according to the probability values output by the two classifiers and the code bit values of the text categories and the two classifiers in the error correction coding table, a target text category meeting a matching condition from the text categories includes:

determining a first distribution characteristic of the text belonging to a corresponding regular text category of the two classifiers according to the probability values output by the two classifiers;

and determining a target text category with the similarity between the second distribution characteristic and the first distribution characteristic meeting the condition from the plurality of text categories according to the first distribution characteristic and a second distribution characteristic corresponding to the text category in the error correction coding table, wherein the second distribution characteristic corresponding to the text category is the distribution characteristic of code bit values between the text category and a plurality of second classifiers.

Preferably, the determining, according to the probability values output by the two classifiers, a first distribution characteristic that the text belongs to a regular example text category corresponding to each of the two classifiers includes:

sequentially converting the probability values output by the two classifiers into coded values according to the conversion relation between the probability values and the coded values to obtain a first coded vector consisting of a plurality of coded values converted from the probability values output by the two classifiers;

the determining, according to the first distribution feature and a second distribution feature corresponding to the text category in the error correction coding table, a target text category, of which a similarity between the second distribution feature and the first distribution feature satisfies a condition, from the plurality of text categories, includes:

constructing a second encoding vector corresponding to the text category according to the error correction encoding table, wherein the second encoding vector corresponding to the text category is a vector formed by code bit values between the text category and a plurality of secondary classifiers;

and selecting a target text category with the minimum Hamming distance between a second encoding vector and the first encoding vector from the plurality of text categories.

Preferably, the selecting a target text category with a smallest hamming distance between a second coded vector and the first coded vector from the plurality of text categories includes:

if one candidate text category with the minimum Hamming distance between the second encoding vector and the first encoding vector exists in the plurality of text categories, determining the corresponding text category with the minimum Hamming distance as a target text category;

if the candidate text categories are multiple, selecting a target text category from the multiple candidate text categories by any one of the following modes:

randomly selecting one candidate text category from the candidate text categories as a target text category;

alternatively, the first and second electrodes may be,

determining a preset number of binary classifiers of which the candidate text classes belong to the positive example text classes from an error correction coding table, summing probability values output by the preset number of binary classifiers in the text classification model to obtain positive example probability sums corresponding to the candidate text classes, and determining the corresponding positive example probability sums with the maximum candidate text classes as target text classes;

alternatively, the first and second electrodes may be,

and respectively differentiating the code place values corresponding to the candidate text categories and the plurality of classifiers with the probability values output by the plurality of classifiers to obtain a plurality of difference values, calculating the sum of absolute values of the difference values, and determining the candidate text category with the minimum sum of the absolute values as the target text category.

Preferably, the text classification model is a transform-based bidirectional encoder BERT model.

Preferably, the error correction coding table is constructed in a process of training the text classification model, and the text classification model and the error correction coding table are obtained by the following method:

acquiring a text training set, wherein the text training set comprises a plurality of text training samples marked with categories;

constructing an error correction coding table according to a construction rule of an error correction output code and the category marked by the text training samples in the text training set, wherein for each two classifiers in the error correction coding table, the quantity proportion of the text training samples belonging to the positive text category of the two classifiers and the text training samples belonging to the negative text category of the two classifiers in the text training set meets a first set proportion range;

inputting the text training sample into a text classification model to be trained to obtain a probability value corresponding to the text training sample output by a plurality of two classifiers of the text classification model;

according to the probability values corresponding to the text training samples and the code bit values corresponding to the text classes and the two classifiers in the error correction coding table, determining predicted text classes corresponding to the text training samples from the text classes;

detecting whether the prediction accuracy of the text classification model meets the requirement or not based on the predicted text categories and the actually labeled categories of the plurality of text training samples;

if the prediction accuracy of the text classification model does not meet the requirement, adjusting internal parameters of the text classification model according to a set loss function, and adjusting the error correction coding table according to the condition that the quantity proportion of text training samples belonging to positive case class samples and text training samples belonging to negative case class samples in a text training set meets a second proportion range;

and if the prediction accuracy of the text classification model meets the requirement, obtaining the trained text classification model and the constructed error correction coding table.

In another aspect, the present application further provides a text classification apparatus, including:

the text obtaining unit is used for obtaining texts to be classified;

the text prediction unit is used for inputting the text into a trained text classification model to obtain probability values output by a plurality of two classifiers in the text classification model, wherein the probability values output by the two classifiers represent the probability that the text belongs to the regular example text category corresponding to the two classifiers;

a table obtaining unit, configured to obtain a constructed error correction coding table, where the error correction coding table at least includes: the correspondence between a plurality of text categories and code bit values among the plurality of second classifiers is used for indicating whether the text categories belong to the regular text categories corresponding to the second classifiers;

the category matching unit is used for determining a target text category meeting matching conditions from the text categories according to the probability values output by the two classifiers and the code bit values of the text categories and the two classifiers in the error correction coding table;

a text classification unit for classifying the text into the target text category.

In another aspect, the present application further provides an electronic device, including:

a processor and a memory;

the processor is used for obtaining texts to be classified; inputting the text into a trained text classification model to obtain probability values output by a plurality of two classifiers in the text classification model, wherein the probability values output by the two classifiers represent the probability that the text belongs to the regular text classes corresponding to the two classifiers; obtaining a constructed error correction coding table, wherein the error correction coding table at least comprises: the correspondence between a plurality of text categories and code bit values among the plurality of second classifiers is used for indicating whether the text categories belong to the regular text categories corresponding to the second classifiers; determining a target text category meeting matching conditions from the text categories according to the probability values output by the two classifiers and the code bit values corresponding to the text categories and the two classifiers in the error correction coding table; classifying the text into the target text category;

the memory is used for storing programs needed by the processor to execute the above operations.

Preferably, when the processor determines, according to the probability values output by the two classifiers and the code bit values corresponding to the text categories and the two classifiers in the error correction coding table, a target text category satisfying a matching condition from the text categories, specifically:

Preferably, when the processor determines, according to the probability values output by the two classifiers, that the text belongs to the first distribution feature of the corresponding regular text category of each of the two classifiers, the method specifically includes: sequentially converting the probability values output by the two classifiers into coded values according to the conversion relation between the probability values and the coded values to obtain a first coded vector consisting of a plurality of coded values converted from the probability values output by the two classifiers;

when the processor determines, according to the first distribution feature and a second distribution feature corresponding to the text category in the error correction coding table, a target text category, of which the similarity between the second distribution feature and the first distribution feature satisfies a condition, from the plurality of text categories, the method specifically includes:

According to the scheme, the text to be classified is predicted by using the trained text classification model, and the probability values predicted by a plurality of two classifiers in the text classification model can be obtained. Since the probability value predicted by the two classifiers reflects the probability that the text belongs to the text class of the regular example corresponding to the two classifiers, the text class matched with the distribution situation of the text class of the regular example of the text belonging to each two classifiers, namely the text class to which the text belongs, can be determined from a plurality of text classes by combining the situation that each text class in the error correction coding table constructed in advance belongs to the text class of the regular example of each two classifiers. When the method and the device are used for classifying the texts, classification is not only carried out by using the trained model, but also the condition that each text classification in the pre-constructed error correction coding table belongs to the corresponding text classification of the positive case of each two-classifier in the text classification model is combined, so that the method and the device are beneficial to classifying the texts more accurately, and the text classification effect is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flowchart of a text classification method according to an embodiment of the present application;

fig. 2 is a schematic diagram illustrating a composition of an error correction coding table according to an embodiment of the present application;

fig. 3 is a schematic flowchart of another text classification method according to an embodiment of the present application;

FIG. 4 is a schematic diagram illustrating an implementation principle of determining a target text category according to an embodiment of the present application;

FIG. 5 is a schematic flow chart of the construction and training of a text classification model according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a composition of the text classification apparatus provided in the present application;

fig. 7 is a schematic structural diagram of an electronic device provided in the present application.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings described above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the application described herein may be practiced otherwise than as specifically illustrated.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without inventive step, are within the scope of the present disclosure.

As shown in fig. 1, which shows a flowchart of the text classification method of the present application, the method of the present embodiment can be applied to any electronic device with data processing capability.

The method of the embodiment comprises the following steps:

s101, obtaining a text to be classified.

For example, the text to be classified refers to the text to be classified.

The text may include at least one character, for example, the text may be a sentence, a paragraph, or a chapter.

And S102, inputting the text into the trained text classification model to obtain probability values output by a plurality of two classifiers in the text classification model.

The text classification model is a classification model which is trained in advance and comprises a plurality of two classifiers. The text classification model is obtained by training based on the constructed error correction coding table and a plurality of text samples marked with text categories.

Wherein, the error correction coding table at least represents at least one kind of positive case text category corresponding to each two classifiers in the text classification model. Of course, the error correction coding table may also characterize at least one negative case text category corresponding to each two classifiers in the text classification model. After the error correction coding table is constructed, the corresponding text type of the normal case of the two classifiers is determined. Wherein the error correction coding table is described in detail in the following steps.

The text category corresponding to each secondary classifier is divided into two main categories which are the regular text categories; one type is the negative example text category. Correspondingly, all positive example text categories and all negative example text categories corresponding to each two-classifier constitute all set text categories, wherein the total number and the types of the text categories can be set as required.

For example, assuming that 10 different text categories are set, namely, text category 1-text category 10, then for a two-classifier, an error correction coding table is constructed to represent that text category 1, text category 3, text category 4 and text category 5 belong to a positive example text category of the two-classifier, and the remaining 6 text categories are negative example text categories of the two-classifier.

In this embodiment, the probability value of the text output by each two classifiers represents the probability that the text belongs to the proper example text category corresponding to the two classifiers. For example, if the type of the regular example text corresponding to the two classifiers is represented as text type 1, text type 2 and text type 3 in the error correction coding table, and the probability of the text corresponding to the two classifiers is predicted to be 0.8, it is assumed that the probability of the text belonging to the three regular example text types, i.e., text type 1, text type 2 and text type 3, is 0.8.

Optionally, in order to improve the accuracy of text classification, the classification model of the present application may be a transform-based Bidirectional coder (BERT) model. Wherein, the Transformer is a model built by an attention mechanism.

Compared with a Recurrent Neural Network (RNN) model and the like, the BERT model can better extract text features, improves the parallel computing capability, captures the long-distance features of the Neural Language Programming (NLP), and is favorable for more accurately predicting text classification. On the other hand, since the BERT can predict words according to the context, the classification prediction is carried out by comprehensively considering the semantics of the text, and the accuracy of text classification is also improved.

S103, obtaining the constructed error correction coding table.

Wherein the error correction coding table at least comprises: correspondence of code bit values between a plurality of text categories and the plurality of two classifiers. And the code bit value between each text category and the two classifiers is used for indicating whether the text category belongs to the formal text category corresponding to the two classifiers. That is, the code bit value between the text category and the classifier bi-represents whether the text category belongs to the positive case text category or the negative case text category of the classifier bi-.

Fig. 2 shows a schematic diagram of an error correction coding table.

In fig. 2, the text classes include m classes and the text classification model has n binary classifiers, where m is a natural number greater than 1 and n is also a natural number greater than 1. As shown in fig. 2, the m text classes are sequentially class 1 to class m, and the n binary classifiers are sequentially classifier 1 to classifier n.

In the error correction coding table of fig. 2, the code bit values include both 1 and-1. In the error correction coding table, each text category has a code space value corresponding to each classifier.

The code bit values corresponding to the text category and the two classifiers are used for indicating that the text category belongs to a positive example text category or a negative example text category of the two classifiers, wherein if the code bit value corresponding to the text category and the two classifiers is 1, the text category is indicated to be the positive example text category corresponding to the two classifiers; and if the code space value of the text category corresponding to the binary classifier is-1, indicating that the text category is a negative example text category corresponding to the binary classifier.

As in the error correction coding table of fig. 2, a row in which each category is located is a code bit value of each classifier corresponding to the category, for example, if the code bit value of the classifier 2 corresponding to the category 1 is-1, the category 1 belongs to the negative example text category corresponding to the classifier 1; and the code bit value of the classifier 2 corresponding to the class 1 is +1, which indicates that the class 1 belongs to the text class of the positive example corresponding to the classifier 2.

In the embodiment of the present application, the error correction coding table needs to be constructed to satisfy the set table construction condition. The table building conditions may include one or more of the following three conditions:

condition 1: there is no code reversal. That is, any two columns in the error correction coding table cannot be complemented. For example, taking n classifiers (hereinafter, abbreviated as "classifier") and m text categories (abbreviated as "category") as an example, and taking two columns of the classifier 1 and the classifier 2 as an example, the m code bit values corresponding to the m text categories of the classifier 1 cannot be complementary to the m code bit values corresponding to the m categories of the classifier 2, for example, assuming that m is 3, and the 3 code bit values corresponding to the 3 categories of the classifier 1 are-1, +1, -1, respectively; and the 3 code bit values corresponding to the classifier 2 and the 3 classes are respectively +1, -1 and +1, so that the two columns where the classifier 1 and the classifier 2 are located are complementary, and the condition of no code reversal is not met.

Condition 2: the rows are separated. That is, the hamming distance between the code bit value corresponding to each class and the code bit values of other classes is as large as possible. For example, the code bit values of n classifiers corresponding to each class may form a row vector with length n; calculating the Hamming distance of the row vector between every two m categories to obtain a plurality of Hamming distances, and then calculating the mean value and standard deviation between the obtained Hamming distances. In the process of constructing the error correction coding table, it is desirable to achieve convergence by continuously adjusting the code bit value between each category and the classifier in the error correction coding table, so that the smaller the standard deviation is, the better the average value is.

Condition 3: the columns are separated. If the code bit values corresponding to m classes of each classifier form a column vector with the length of m corresponding to the classifier; calculating the Hamming distance of the column vectors between every two n classifiers to obtain a plurality of Hamming distances; the calculated mean and standard deviation between the plurality of hamming distances is then calculated. In the process of constructing the error correction coding table, it is desirable to achieve convergence by continuously adjusting the code bit value between each category in the error correction coding table and the classifier, so that the smaller the standard deviation is, the better the standard deviation is, the larger the mean value is, the better the mean value is.

Of course, the above is the basic condition that the constructed error correction coding table needs to satisfy.

In the process of training the text classification model by combining the error correction coding table, the error correction coding table is continuously adjusted by combining the prediction precision of the text classification model, the text category to which the text training sample belongs, the set positive and negative text category proportion and the like, so that the error correction coding table can more accurately reflect the relation between each classifier and the text category, and generally and subsequently accurately determine the text category to which the text to be classified belongs. Specifically, the following process of training the text classification model is described in detail.

And S104, determining a target text category meeting the matching condition from the plurality of text categories according to the probability values output by the plurality of second classifiers and the code bit values corresponding to each text category and the plurality of second classifiers in the error correction coding table.

The matching condition may be set as needed.

For example, the distribution of the text classes of the target text meeting the matching condition belonging to the regular example of each two-classifier and the distribution of the probability values predicted by each two-classifier in the text classification model meet the set relationship. For example, the set relationship may be a distribution of the text category belonging to the positive example of each two classifiers in the text classification model, which is represented by the probability value predicted by each two classifiers in the text classification model, and a similarity between the distribution of the text category belonging to the positive example of each two classifiers in the text classification model and the distribution of the target text category belonging to the positive example text category of each two classifiers satisfies a set condition, such as the highest similarity.

It can be understood that there are multiple classifiers in the text classification model, and each two-classifier belongs to a probability value that the text belongs to the text category of the positive example corresponding to the two-classifier, so that multiple probability values output by the multiple two-classifiers can represent probability distribution that the text sequentially belongs to the text category of the positive example corresponding to each of the multiple two-classifiers. And the probability distribution condition of the text belonging to the regular text categories corresponding to the two classifiers can reflect the distribution condition of the text belonging to the regular text categories of the two classifiers.

And the code bit value between each text category and each two-classifier in the error correction coding table also represents whether the text category belongs to the text category of the positive case corresponding to the two-classifier, so that the code bit value between each text category and each two-classifier reflects the distribution condition of the text category of the positive case belonging to each two-classifier.

From the above analysis, if the distribution of the text belonging to the regular text categories of the two classifiers is similar to the distribution of the text category belonging to the regular text categories of the two classifiers, it is determined that the text category to which the text belongs is the text category.

For example, the following steps are carried out:

assume that there are 5 classifiers, classifier 1 through classifier 5, in the text classification model.

And text class 1 belongs to the positive case text class of the binary classifier 1 and the binary classifier 2, and belongs to the negative case text class of the binary classifier 3, the binary classifier 4, and the binary classifier 5.

And the text class belongs to the positive case text class of the binary classifier 1 and the binary classifier 3 and to the negative case text class of the binary classifier 2, the binary classifier 4 and the binary classifier 5.

Assuming that for a text to be classified, the probability values predicted by the 5 secondary classifiers for the text are sequentially as follows: 0.6, 0.8, 0.4, 0.2, 0.3, and the probability value predicted by the binary classifier is greater than 0.5, which indicates that the binary classifier has a high probability of considering that the text belongs to the regular text category of the binary classifier, and therefore, the probability values predicted based on the 5 binary classifiers can indicate that: the second classifier 1 considers that the text belongs to the formal text category of the second classifier; the two classifiers 2 consider the text to belong to the positive example text class of the two classifiers, and similarly, the other three classifiers predict the text to belong to the negative example text class thereof. By comparing the above cases where the text type 1 and the text type 2 belong to the proper text type of each two-classifiers, it can be known that if the text belongs to the text type 1, the prediction result of each two-classifier is similar to the case where the text type 1 belongs to the proper text type of each two-classifier in the error correction coding table, and therefore, the text type 1 is the target text type.

As an optional manner, the application may also determine, according to the probability values output by the two classifiers, first distribution features of the text belonging to the proper example text category corresponding to each of the two classifiers. As can be seen from the foregoing description, the probability values output by the two classifiers can represent the probability distribution situation that the text sequentially belongs to the corresponding regular text classes of the two classifiers, and therefore, the distribution feature reflecting the probability distribution situation can be constructed based on the probability values.

For example, according to the principle that the probability value output by the two classifiers is greater than a set value (for example, the set value may be 0.5), the text is determined to belong to the proper example text category of the two classifiers, and the first distribution feature of the text belonging to the proper example text category corresponding to each of the two classifiers can be obtained.

On this basis, according to the first distribution characteristic and the second distribution characteristic corresponding to each text category in the error correction coding table, a target text category, of which the similarity between the second distribution characteristic and the first distribution characteristic satisfies the condition, can be determined from the plurality of text categories.

The second distribution characteristic corresponding to the text category is a distribution characteristic of code bit values between the text category and the plurality of second classifiers, so that the second distribution characteristic corresponding to the text category reflects the distribution characteristic of the text category belonging to the regular text category of each second classifier.

Accordingly, the condition that the similarity between the second distribution feature and the first distribution feature satisfies may be that the similarity between the second distribution feature and the first distribution feature is the highest, or that the similarity between the second distribution feature and the first distribution feature exceeds a set degree or threshold, and so on.

S105, classifying the text into the target text category.

The distribution condition that the target text category belongs to the regular text categories of the two classifiers in the text classification model is matched with the distribution condition that the text to be classified belongs to the regular text categories of the two classifiers, so that the probability that the text belongs to the target text category is the largest, and the text can be classified into the target text category.

Therefore, the text to be classified is predicted by the trained text classification model, and the probability values predicted by the two classifiers in the text classification model can be obtained. Since the probability value predicted by the two classifiers reflects the probability that the text belongs to the text class of the regular example corresponding to the two classifiers, the text class matched with the distribution situation of the text class of the regular example of the text belonging to each two classifiers, namely the text class to which the text belongs, can be determined from a plurality of text classes by combining the situation that each text class in the error correction coding table constructed in advance belongs to the text class of the regular example of each two classifiers. When the method and the device are used for classifying the texts, classification is not only carried out by using the trained model, but also the condition that each text classification in the pre-constructed error correction coding table belongs to the corresponding text classification of the positive case of each two-classifier in the text classification model is combined, so that the method and the device are beneficial to classifying the texts more accurately, and the text classification effect is improved.

For convenience of understanding, the process of determining the target category to which the text belongs is illustrated by taking a case that a first distribution characteristic of the text belonging to a formal text category corresponding to each of the two classifiers is determined according to probability values output by the two classifiers, and the target text category is determined by combining the first distribution characteristic and a second distribution characteristic corresponding to each text category in the error correction coding table. The following is a description of one case of constructing the first distribution characteristic and the second distribution characteristic.

As shown in fig. 3, which shows a schematic flow chart of another embodiment of the text classification method of the present application, the method of the present embodiment may include:

s301, obtaining the text to be classified.

S302, inputting the text into the trained text classification model to obtain probability values output by a plurality of two classifiers in the text classification model.

S303, obtaining the constructed error correction coding table.

Wherein the error correction coding table at least comprises: correspondence of code bit values between a plurality of text categories and the plurality of two classifiers. And the code bit value between each text category and the two classifiers is used for indicating whether the text category belongs to the formal text category corresponding to the two classifiers.

The above steps S301 to S303 can refer to the related description of the previous embodiment, and are not described herein again.

And S304, sequentially converting the probability values output by the two classifiers into the coded values according to the conversion relation between the probability values and the coded values, and obtaining a first coded vector formed by the plurality of coded values converted from the probability values output by the two classifiers.

It is understood that the probability value of the text output by the classifier reflects the probability that the text belongs to the positive example text category of the classifier, and the code value corresponding to the text category in the error correction code table reflects that the text category is the positive example text category or the negative example text category of a certain two-classifier, so that the probability value can be converted into the code value according to the magnitude relation between the probability value and the set value.

For example, the transformation relationship may be: if the probability value is larger than the set value, the probability value is converted into a code value representing the type of the text of the positive case in the error correction code table; if the probability value is not larger than the set value, the probability value is converted into a code value representing the text category of the negative example in the error correction code table.

It can be understood that if the probability value output by the two classifiers is greater than the set value, the probability that the text belongs to the proper text category of the two classifiers is high, and therefore, the coded value for representing the proper text category in the error correction coding table can be used for representing.

For example, based on the setting of the code value in the error correction coding table shown in fig. 2, and assuming that 0.5 is set, the probability value is greater than 0.5, and the converted code value is + 1; if the probability value is not greater than 0.5, the code value converted from the probability value is-1

Accordingly, the first code vector is composed of a plurality of code values converted from the probability values output by the two classifiers. For example, if the text classification model includes 4 two classifiers and the sequentially output probability values are 0.8, 0.6, 0.2, and 0.7, the meaning of the code values in the error correction table based on fig. 2 is determined, and the first code vector can be represented as (+1, +1, -1, +1) with 0.5 as an example.

S305, constructing a second coding vector corresponding to each text type according to the error correction coding table.

And the second encoding vector corresponding to the text category is a vector formed by code bit values between the text category and the two classifiers. That is, the second encoding vector corresponding to the text category is formed by the code bit values of the row where the text category is located in the error correction encoding table. For example, in fig. 2, assuming that there are 4 two classifiers, i.e. n is 4, the coded values of text class 1 corresponding to the three two classifiers are-1, +1, +1, -1 in sequence, and thus the second coded vector is (-1, +1, +1, -1).

S306, selecting the target text type with the minimum Hamming distance between the second encoding vector and the first encoding vector from the plurality of text types.

That is, for each text category, the hamming distance between the second encoded vector and the first encoded vector of the text category is calculated separately. Then, the minimum hamming distance is found, and the text category corresponding to the minimum hamming distance is determined as the target text category.

For ease of understanding the above steps S304 to S305, reference may be made to fig. 4. Fig. 4 shows a schematic diagram of determining a target text category according to an error correction coding table and probability values predicted by each two classifiers in a text classification model for the text.

In fig. 4, 5 binary classifiers are included in the text classification model, and there are 4 text classes, i.e., class 1 to class 4.

The error correction coding table in fig. 4 shows the coding values of the two classifiers corresponding to each class. The respective code values of the row in which each category is located in the error correction code table actually constitute the second code vector corresponding to that category. For example, category 1 corresponds to 5 two classifiers with code values of +1, +1, -1, +1, +1, and the category 1 corresponds to a second code vector of (+1, +1, -1, +1, +1), and the remaining categories are similar.

In fig. 4, the probability values predicted by the 5 classifiers are 0.4, 0.3, 0.8, 0.2 and 0.6 in sequence. In order to determine the first encoding vector, the converted encoding value may be 1 according to the probability value greater than 0.5; otherwise-1, the rule converts the predicted probability values of the 5 classifiers into coded values respectively, and the coded values converted from the probability values output by the 5 classifiers in fig. 4 are sequentially-1, -1, +1, -1, + 1. Accordingly, the first encoded vector is (-1, -1, +1, -1, + 1).

Then, the hamming distance is calculated by sequentially associating the first encoded vector with the second encoded values corresponding to each of the above categories 1 to 4. If so, the Hamming distance between the second code vector of class 1 and the first code vector is 3; the hamming distance between the second coded vector of category 2 and the first coded vector is 4; the hamming distance between the second coded vector of category 3 and the first coded vector is 1; the hamming distance between the second coded vector of category 4 and the first coded vector is 2.

From this, it is understood that the corresponding category having the smallest hamming distance is category 3, and thus it can be confirmed that the text belongs to category 3.

It is to be understood that, if there is one text category with the smallest hamming distance between the second encoding vector and the first encoding vector among the plurality of text categories, the corresponding text category with the smallest hamming distance is determined as the target text category.

However, in practical applications, there is a possibility that there is not only one text type corresponding to the minimum hamming distance, and in this case, it is also possible to select a target text type from among the text types corresponding to the plurality of minimum hamming distances. For the sake of convenience of distinction, the text category with the smallest hamming distance between the second coded vector and the first coded vector is referred to as a candidate text category. In the embodiment of the present application, there may be a plurality of ways to select the target text category from the plurality of text categories.

For example, in one possible scenario, a candidate text category may be randomly selected from a plurality of candidate text categories as the target text category.

For another example, in a possible case, for each candidate text category, a first set number of classifiers to which the candidate text category belongs to the text category of the positive case may be determined from the error correction coding table, and the probability values output by the first set number of classifiers in the text classification model are summed to obtain a positive case probability sum corresponding to the candidate text category. Then, the corresponding positive example probability and the largest candidate text category are determined as the target text category.

For example, assume that the first set number is the first 3, assume that the candidate text class is the text class 3, and assume that the two classifiers include classifier 1 through classifier 5. In the error correction coding table, the coding values of the text class corresponding to the 5 classifiers are +1, -1, +1, +1, +1, in this order, so that the two classifiers of the text class 3 belonging to the regular text class, that is, the first 3 two classifiers corresponding to the text class and having the coding value of +1, that is, the classifier 1, the classifier 3, and the classifier 4, are included. In this case, the probability values output by the

classifiers

1, 3 and 4 may be added, and the added value is the positive case probability sum corresponding to the text category 3. Wherein, when the number is set as the total number of the two classifiers, the probability of the positive case here is added to the probability values of all the classifiers.

It can be understood that the probability that the pre-set number of classifiers belonging to the text category of the good case are the same is smaller, and correspondingly, the probability of the good case and the same probability obtained finally also almost do not exist, so that one target text category can be selected uniquely.

In yet another possible case, for each candidate text category, the code place values corresponding to the candidate text category and the plurality of classifiers are respectively subtracted from the probability values output by the plurality of classifiers to obtain a plurality of difference values, and the sum of the absolute values of the respective absolute values of the plurality of difference values is calculated. Then, the candidate text category with the smallest sum of the corresponding absolute values is determined as the target text category.

S307, the text is classified into the target text category.

This step S307 can be referred to the related description above.

In order to improve the accuracy of text classification prediction in the scheme of the application, an error correction coding table can be constructed according to error correction code input construction rules, then a text classification model is trained based on the error correction coding table, and the error correction coding table is continuously adjusted in the process of training the text classification model, so that the coding values between each classifier and each classification category in the error correction coding table are more attached to the text classification model, and the prediction accuracy of each classifier in the text classification model is higher.

For ease of understanding, one case will be described as an example. As shown in fig. 5, which shows a schematic flowchart of training a text classification model according to the present application, the flowchart may include:

s501, acquiring a text training set.

The text training set comprises a plurality of text training samples marked with categories.

The text classification is premised on the requirement of a lot of labeled label data (text samples to be labeled), and the common practice is to manually screen more meaningful data from a lot of data to perform labeling, which is time-consuming and labor-consuming. In order to obtain more valuable training data sets with labels, the method proposes to select the data needing to be labeled by using information entropy sorting. The specific method comprises the following steps: predicting a plurality of unlabeled data (samples) by using an initially trained BERT model serving as a text classification model, wherein each piece of data can obtain probabilities of different classes, and assuming that the number of the classes is 3, three normalized probability values are obtained: 0.7, 0.2, 0.1, the information entropy of the piece of data is-0.7 log (0.7) -0.2log (0.2) -0.1log (0.1), (of course, if there are more than 3 categories such as 10, the information entropy of the first n (n ═ 10) pieces of data can be calculated in addition to the information entropy for calculating 10 probability values, and the method is the first n large information entropy), and the information entropy of the second piece of data can be obtained by the same method, and finally, several pieces of data with large information entropy are screened from numerous data and used for being labeled by a service person, and the data with large information entropy is called as a difficult sample, and the difficult sample can be trained in a training set, so that the accuracy is higher for the text to be classified.

For the sake of easy distinction, the text for training is referred to as a text training sample. In the process of training the text classification model, the text class to which each text training sample belongs is known, so that the class to which each text training sample belongs can be marked.

S502, an error correction coding table is constructed according to the construction rule of the error correction output codes and the categories marked by the text training samples in the text training set.

The construction rule of the error correction output code may include constituent elements of an error correction coding table, such as two classifiers and a text category, and may further include: the meaning of the code values and their specific values.

The error correction coding table constructed based on the construction rule meets the following requirements: for each two-classifier in the error correction coding table, the number proportion of the text training samples belonging to the positive text category of the two-classifier and the text training samples belonging to the negative text category of the two-classifier in the text training set meets a first set proportion range.

For example, when the error correction coding table is first constructed, for each two-classifier, the number ratio of the text training samples belonging to the positive example text category of the two-classifier to the text training samples belonging to the negative example text category of the two-classifier in the text training set satisfies 1: 1. That is, for each two-classifier, the number of text training samples in the text training set that belong to positive examples of classified text is substantially the same as the number of text training samples that belong to negative examples.

For example, assuming that there are 1000 text training samples in the text training set, wherein there are 500 text training samples belonging to the category a1, 300 text training samples belonging to the category a2, and 200 text training samples belonging to the category a3, then for a two-classifier in the ecc table, the category a1 corresponding to the two-classifier may be set as a positive example text category, the category a2 is a negative example text category, and the category a3 is a negative example text category.

Of course, in the subsequent process of training the text classification model, if the error correction coding table needs to be adjusted, the first set proportion range may also be changed accordingly.

It is understood that the error correction coding table constructed also needs to satisfy the table construction conditions, which can be referred to as the aforementioned condition 1, condition 2, and condition 3. Therefore, after the error correction code is constructed, if it is detected that the error correction code table does not conform to the table construction rule, the error correction code table needs to be adjusted before the subsequent steps can be executed.

S503, inputting the text training sample into the text classification model to be trained, and obtaining probability values corresponding to the text training sample output by the two classifiers of the text classification model.

S504, determining a predicted text category corresponding to the text training sample from the plurality of text categories according to the probability values corresponding to the text training samples output by the plurality of second classifiers and the code bit values corresponding to the text categories and the plurality of second classifiers in the error correction coding table.

The process of determining the predicted text category corresponding to the text training sample is similar to the process of determining the target text category to which the text belongs based on the text probability value and the error correction coding table, and is not repeated here.

S505, detecting whether the prediction accuracy of the text classification model meets the requirement or not based on the prediction text types and the actually labeled types of a plurality of text training samples, and if so, finishing training to obtain a trained text classification model and a constructed error correction coding table; if not, step S506 is performed.

S506, if the prediction accuracy of the text classification model does not meet the requirement, adjusting internal parameters of the text classification model according to a set loss function, adjusting an error correction coding table according to the condition that the number ratio of the text training samples belonging to the positive case type samples and the text training samples belonging to the negative case type samples in the text training set meets a second ratio range, and returning to the step S503.

The loss function may be set as needed, which is not limited in this application.

As an alternative, the present application may modify the weights of different classes in the loss function statically and dynamically. For example, normally, each class is weighted 1.0 in the loss function, but when some classes are particularly important, the training samples of the classes are given more weight. The specific implementation mode is that the penalty to the minority class is increased by modifying the weight of the loss function, the importance of the minority class is emphasized, and a model with better performance is designed.

In another aspect, the present application further provides a text classification device, as shown in fig. 6, which shows a schematic structural diagram of a text classification device of the present application, and the device of this embodiment may include:

a text obtaining unit 601, configured to obtain a text to be classified;

a text prediction unit 602, configured to input the text into a trained text classification model, and obtain probability values output by a plurality of binary classifiers in the text classification model, where the probability values output by the binary classifiers represent probabilities that the text belongs to a regular text category corresponding to the binary classifiers;

a table obtaining unit 603, configured to obtain a constructed error correction coding table, where the error correction coding table at least includes: the correspondence between a plurality of text categories and code bit values among the plurality of second classifiers is used for indicating whether the text categories belong to the regular text categories corresponding to the second classifiers;

a category matching unit 604, configured to determine, according to the probability values output by the multiple second classifiers and code bit values corresponding to the text categories and the multiple second classifiers in the error correction coding table, a target text category that meets a matching condition from the multiple text categories;

a text classification unit 605 configured to classify the text into the target text category.

In a possible implementation manner, the category matching unit includes:

the feature determination unit is used for determining first distribution features of the texts belonging to the corresponding formal text categories of the two classifiers according to the probability values output by the two classifiers;

and a category determining unit, configured to determine, according to the first distribution feature and a second distribution feature corresponding to the text category in the error correction coding table, a target text category where a similarity between the second distribution feature and the first distribution feature satisfies a condition from the plurality of text categories, where the second distribution feature corresponding to the text category is a distribution feature of code bit values between the text category and a plurality of second classifiers.

Optionally, the feature determination includes:

the coding conversion unit is used for sequentially converting the probability values output by the two classifiers into coding values according to the conversion relation between the probability values and the coding values to obtain a first coding vector consisting of a plurality of coding values converted by the probability values output by the two classifiers;

the category determination unit includes:

the vector construction unit is used for constructing a second coding vector corresponding to the text category according to the error correction coding table, wherein the second coding vector corresponding to the text category is a vector formed by code bit values between the text category and a plurality of secondary classifiers;

and the category selection unit is used for selecting a target text category with the minimum Hamming distance between a second encoding vector and the first encoding vector from the plurality of text categories.

Optionally, the category selecting unit includes:

and the first class selecting unit is used for determining the corresponding text class with the minimum hamming distance as the target text class if one candidate text class with the minimum hamming distance between the second encoding vector and the first encoding vector exists in the plurality of text classes.

A second category selecting unit, configured to, if there are multiple candidate text categories, select a target text category from the multiple candidate text categories in any one of the following manners:

alternatively, the first and second electrodes may be,

Optionally, the apparatus of the present application further includes: a model and table constructing unit, configured to obtain the text classification model and the error correction coding table in the following manner:

In another aspect, the present application further provides an electronic device. As shown in fig. 7, a schematic diagram of a component structure of an electronic device according to the present application is shown, and the electronic device includes at least a processor 701 and a memory 702.

The processor 701 is configured to obtain a text to be classified; inputting the text into a trained text classification model to obtain probability values output by a plurality of two classifiers in the text classification model, wherein the probability values output by the two classifiers represent the probability that the text belongs to the regular text classes corresponding to the two classifiers; obtaining a constructed error correction coding table, wherein the error correction coding table at least comprises: the correspondence between a plurality of text categories and code bit values among the plurality of second classifiers is used for indicating whether the text categories belong to the regular text categories corresponding to the second classifiers; determining a target text category meeting matching conditions from the text categories according to the probability values output by the two classifiers and the code bit values corresponding to the text categories and the two classifiers in the error correction coding table; classifying the text into the target text category;

the memory 702 is used for storing programs needed by the processor to perform the above operations.

Of course, the electronic device may further include an input unit, a display unit, a communication module, and the like, which is not limited herein.

Optionally, when the processor determines, according to the probability values output by the two classifiers and the code bit values corresponding to the text categories and the two classifiers in the error correction coding table, a target text category meeting a matching condition from the text categories, the method is specifically configured to:

Optionally, when the processor determines, according to the probability values output by the two classifiers, that the text belongs to the first distribution feature of the corresponding text category of the positive case corresponding to each of the two classifiers, the method specifically includes: sequentially converting the probability values output by the two classifiers into coded values according to the conversion relation between the probability values and the coded values to obtain a first coded vector consisting of a plurality of coded values converted from the probability values output by the two classifiers;

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of text classification, comprising:

obtaining a text to be classified;

classifying the text into the target text category.

2. The method of claim 1, wherein the determining a target text category satisfying a matching condition from the text categories according to the probability values output by the two classifiers and code bit values of the text categories corresponding to the two classifiers in the error correction coding table comprises:

3. The method of claim 2, wherein determining the first distribution characteristic of the text belonging to the proper example text category corresponding to each of the two classifiers according to the probability values output by the two classifiers comprises:

4. The method of claim 3, wherein said selecting a target text category from the plurality of text categories having a smallest Hamming distance between a second encoded vector and the first encoded vector comprises:

alternatively, the first and second electrodes may be,

5. The method of claim 1, the text classification model being a transform-based bi-directional encoder BERT model.

6. The method according to claim 1 or 5, wherein the ECC table is constructed during the training of the text classification model, and the text classification model and the ECC table are obtained by:

7. A text classification apparatus comprising:

the text obtaining unit is used for obtaining texts to be classified;

8. An electronic device, comprising:

a processor and a memory;

9. The electronic device of claim 8, when the processor determines, according to the probability values output by the two classifiers and code bit values corresponding to the text classes and the two classifiers in the error correction coding table, a target text class satisfying a matching condition from the text classes, the processor is specifically configured to:

10. The electronic device of claim 9, when the processor determines, according to the probability values output by the two classifiers, that the text belongs to the first distribution feature of the corresponding regular text category of each of the two classifiers, specifically: sequentially converting the probability values output by the two classifiers into coded values according to the conversion relation between the probability values and the coded values to obtain a first coded vector consisting of a plurality of coded values converted from the probability values output by the two classifiers;