CN114186065B

CN114186065B - Classification result correction method, system, device and medium

Info

Publication number: CN114186065B
Application number: CN202210133548.8A
Authority: CN
Inventors: 刘红丽; 李峰; 于彤; 周镇镇
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2022-02-14
Filing date: 2022-02-14
Publication date: 2022-05-17
Anticipated expiration: 2042-02-14
Also published as: WO2023151284A1; CN114186065A

Abstract

The invention discloses a classification result correction method, which comprises the following steps: constructing a data set and labeling each data in the data set with a classification label of a corresponding category; inputting each data in the data set into the trained model to obtain the probability of the corresponding classification label and calculating a correction matrix by using the probability of the classification label corresponding to each data; expanding the classification label of each category into a plurality of sub-labels; adjusting the output of the trained model to be the probability of a plurality of sub-labels corresponding to each category; inputting data to be classified into a trained model to obtain the probability of a plurality of sub-labels corresponding to each category; and determining the final class of the data to be classified by using the probability of the plurality of sub-labels corresponding to each class and the correction matrix. The invention also discloses a system, a computer device and a readable storage medium. The scheme provided by the invention eliminates the bias caused by different occurrence frequencies of the tags by expanding the tags.

Description

Classification result correction method, system, device and medium

Technical Field

The present invention relates to the field of classification, and in particular, to a method, system, device, and storage medium for correcting a classification result.

Background

The most central abilities of the massive model are zero-sample learning and small-sample learning. I.e. when facing different application tasks, the model does not need to be retrained. However, the massive models can bring bias from the corpus during pre-training, resulting in low accuracy or unstable performance of downstream tasks. The existing solution is to compensate for the biased tagged words by text-free input, calibrate them to an unbiased state, and reduce the differences between different prompt options. However, due to the difference of the occurrence frequency of the labels in the pre-training corpus, the model has preference for the prediction result, that is, the output accuracy of the model is low. Therefore, the existing correction method can only correct the bias of the model to the label and cannot correct the bias brought by the input sample.

Disclosure of Invention

In view of the above, in order to overcome at least one aspect of the above problems, an embodiment of the present invention provides a classification result correction method, including the following steps:

constructing a data set and labeling each data in the data set with a classification label of a corresponding category;

inputting each data in the data set into a trained model to obtain the probability of the corresponding classification label and calculating a correction matrix by using the probability of the classification label corresponding to each data;

expanding the classification label of each category into a plurality of sub-labels;

adjusting the output of the trained model to be the probability of a plurality of sub-labels corresponding to each category;

inputting data to be classified into a trained model to obtain the probability of a plurality of sub-labels corresponding to each class;

and determining the final class of the data to be classified by utilizing the probability of the plurality of sub-labels corresponding to each class and the correction matrix.

In some embodiments, inputting each data in the data set into a trained model to obtain a probability of the corresponding class label and calculating a correction matrix using the class label probability corresponding to each data, further comprises:

the probability of the classification label corresponding to each data is summed according to the classes and averaged to obtain the probability corresponding to each class;

normalizing the probability corresponding to each category and then constructing a diagonal matrix;

and obtaining the correction matrix after inverting the diagonal matrix.

In some embodiments, expanding the classification label of each category into a plurality of sub-labels further comprises:

obtaining a plurality of similar meaning words corresponding to the classification labels of each category by using a preset model;

and respectively screening a preset number of words from the plurality of similar words corresponding to each category as the plurality of sub-labels corresponding to each category.

In some embodiments, respectively screening a preset number of words from the plurality of near-meaning words corresponding to each category as the plurality of sub-labels corresponding to each category, further includes:

deleting words in the vocabulary which are not present in the trained model from the plurality of similar words;

adjusting the output of the trained model to the probabilities of the remaining near-synonyms;

inputting each data in the data set into a trained model to obtain a probability of the remaining near-synonyms;

deleting words with the probability lower than a first threshold value in the residual similar words according to the probability of the residual similar words output by the trained model;

deleting the words with the probability difference value smaller than the second threshold value in the remaining similar words, and selecting the words with the maximum probability in the preset number as the plurality of sub-labels corresponding to each category.

In some embodiments, determining a final class of the data to be classified by using the probability of the plurality of sub-labels corresponding to each class and the correction matrix further includes:

and calculating the average value of the probabilities of the plurality of sub-labels corresponding to each class according to the class, multiplying the average value corresponding to each class by the correction matrix to obtain a corrected first probability, and taking the maximum value in the first probability of each class as the classification class of the data.

and multiplying the maximum value in the probabilities of the plurality of sub-labels corresponding to each class by the correction matrix to obtain a corrected second probability, and taking the class corresponding to the sub-label with the maximum probability as a second classification class of the data.

and respectively multiplying the probabilities of the plurality of sub-labels corresponding to each category by the correction matrix, then averaging according to the categories to serve as a corrected third probability, and taking the maximum value in the third probability of each category as a third classification category of the data.

Based on the same inventive concept, according to another aspect of the present invention, an embodiment of the present invention further provides a classification result correction system, including:

the building module is configured to build a data set and label each data in the data set with a classification label of a corresponding category;

a calculation module configured to input each data in the data set into a trained model to obtain a probability of the corresponding classification label and calculate a correction matrix using the classification label probability corresponding to each data;

an expansion module configured to expand the classification label of each category into a plurality of sub-labels;

an adjustment module configured to adjust an output of the trained model to probabilities of a plurality of sub-labels corresponding to each category;

the input module is configured to input the data to be classified into the trained model to obtain the probability of a plurality of sub-labels corresponding to each class;

and the correction module is configured to determine the final category of the data to be classified by using the probability of the plurality of sub-labels corresponding to each category and the correction matrix.

Based on the same inventive concept, according to another aspect of the present invention, an embodiment of the present invention further provides a computer apparatus, including:

at least one processor; and

a memory storing a computer program operable on the processor, wherein the processor executes the program to perform any of the steps of the classification result correction method as described above.

Based on the same inventive concept, according to another aspect of the present invention, an embodiment of the present invention further provides a computer-readable storage medium storing a computer program which, when executed by a processor, performs the steps of any one of the classification result correction methods described above.

The invention has one of the following beneficial technical effects: according to the scheme provided by the invention, the labels are expanded, so that the offset caused by different label occurrence frequencies in the pre-training corpus is eliminated, the empty text is replaced by the training set sample, and the offset caused by the label words and the input sample is corrected.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by using the drawings without creative efforts.

Fig. 1 is a schematic flow chart of a classification result correction method according to an embodiment of the present invention;

FIG. 2 is a flow diagram of tag expansion provided by an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a classification result correction system according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a computer device provided in an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a computer-readable storage medium according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.

It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it should be noted that "first" and "second" are merely for convenience of description and should not be construed as limitations of the embodiments of the present invention, and they are not described in any more detail in the following embodiments.

According to an aspect of the present invention, an embodiment of the present invention provides a classification result correction method, as shown in fig. 1, which may include the steps of:

s1, constructing a data set and labeling each data in the data set with a classification label of a corresponding category;

s2, inputting each data in the data set into a trained model to obtain the probability of the corresponding classification label and calculating a correction matrix by using the probability of the classification label corresponding to each data;

s3, expanding the classification label of each category into a plurality of sub-labels;

s4, adjusting the output of the trained model to be the probability of a plurality of sub-labels corresponding to each category;

s5, inputting the data to be classified into the trained model to obtain the probability of a plurality of sub-labels corresponding to each class;

s6, determining the final category of the data to be classified by using the probability of the plurality of sub-labels corresponding to each category and the correction matrix.

According to the scheme provided by the invention, the labels are expanded, so that the offset caused by different label occurrence frequencies in the pre-training corpus is eliminated, the empty text is replaced by the training set sample, and the offset caused by the label words and the input sample is corrected.

In some embodiments, step S2, inputting each data in the data set into a trained model to obtain the probability of the corresponding class label, and calculating a correction matrix using the probability of the class label corresponding to each data, further includes:

and obtaining the correction matrix after inverting the diagonal matrix.

Specifically, after each data in the training set is input into a model (PLM, pre-training language model), the label probabilities of corresponding categories can be obtained, then all the label probabilities of the same category are summed and averaged, a diagonal matrix is constructed after normalization processing, and then the inverse matrix of the diagonal matrix is obtained to obtain the final correction matrix.

For example, after the data a is input into the model, the label probability of the data a can be obtained, where the label corresponds to the first category, and thus the probabilities corresponding to the first category can be obtained by summing and averaging the label probabilities of all the first categories, then the probabilities corresponding to all the categories are normalized, and then the diagonal matrix is constructed and the inverse matrix of the diagonal matrix is obtained to obtain the final correction matrix.

In some embodiments, S3, expanding the classification label of each category into a plurality of sub-labels, further comprises:

Specifically, the expansion label mapping word list can be constructed by outputting the near-sense words through the word2vec model. For selection of near-meaning words, an embedded corpus covering more than 800 ten thousand Chinese words and phrases can be used for training a word2vec model to reveal the correlation among the words, and then a plurality of near-meaning words output by the word2vec model are screened to obtain a plurality of final sub-labels.

As shown in fig. 2, it may be first checked in a traversal manner whether each near word is in the vocabulary space of the model, and the label mapping words not in the vocabulary space are deleted.

It should be noted that each word in the vocabulary space can be used as a tag. The probability of each word in the word list space can be output by the model when one datum is input into the model, but the labels output by the model can be adjusted according to actual requirements due to more words in the word list space.

Each data in the training set can be input into the model to obtain the probabilities of the remaining near-meaning words, and the near-meaning words with the probabilities lower than the average value are divided into rare words, which can cause the prediction probability to be inaccurate, so that the rare words need to be deleted.

Then, because the probabilities of the synonyms predicted by the model are very similar, the tag expansion loses significance, so that the synonyms with the similar probability values can be deleted, the synonym with the highest prediction probability in the synonyms is reserved, and the rest synonyms are deleted.

Finally, the first N of the remaining word lists are selected as the final expanded tag mapping word list, for example, N = min (5, number of tag mapping words after filtering).

Through the above process, each label is expanded into N similar meaning words, so that the deviation caused by a single label is eliminated.

For example, after the tags are expanded, the tags "good rating" of the first category (e.g., the front side) are expanded to [ "good", "front", "satisfactory", "excellent", "bar" ], and the tags "bad rating" of the second category (e.g., the back side) are expanded to [ "bad", "negative", "disappointed", "bad" ].

Ideally, the frequency of occurrence of all tags in the pre-corpus should be approximately the same. However, in experiments, the occurrence frequency of the tags in the corpus is found to be different, so that the model has preference on the prediction result. In practical applications, it is very difficult to manually select qualified label mapping words from nearly 6 ten thousand of word list spaces, and subjective factors are often introduced. Therefore, the label mapping words are expanded by adopting the screening mode,

Specifically, after the correction matrix is obtained, the data to be classified may be input into the model, where the output of the model is the probability of the plurality of labels corresponding to each category, then the average value of the probabilities of the plurality of sub-labels corresponding to each category is calculated according to the category, and then the average value corresponding to each category is multiplied by the correction matrix to be used as the corrected first probability, and the maximum value in the first probability of each category is used as the classification category of the data.

For example, after the data B to be classified is input into the model, the model may output probabilities of 10 sub-labels, that is, probabilities of [ "good", "positive", "satisfactory", "excellent", "bar" ] and probabilities of [ "poor", "negative", "disapproval", "not good", "bad" ], then averages the probabilities of [ "good", "positive", "satisfactory", "excellent", "bar" ], and the probabilities of [ "poor", "negative", "disapproval", "not good", "bad" ], then multiplies the average values by the correction matrix, and finally takes the class corresponding to the average value with the largest probability value as the final classification class of the data B to be classified.

Specifically, after the correction matrix is obtained, the data to be classified may be input into the model, where the output of the model is the probability of the plurality of labels corresponding to each category, then the maximum value of the probabilities of the plurality of sub-labels corresponding to each category is multiplied by the correction matrix to be used as the corrected second probability, and the category corresponding to the sub-label with the highest probability is used as the second classification category of the data.

For example, after the data B to be classified is input into the model, the model may output probabilities of 10 sub-labels, that is, probabilities of [ "good", "positive", "satisfactory", "excellent", "bar" ] and probabilities of [ "poor", "negative", "disapproval", "not good", "bad" ], then multiply the maximum value of the probabilities of [ "good", "positive", "satisfactory", "excellent", "bar" ] by the correction matrix, and the maximum value of the probabilities of [ "poor", "negative", "disapproval", "not good", "bad" ] by the correction matrix, and finally take the corresponding category having a large probability value as the final classification category of the data B to be classified.

Specifically, after the correction matrix is obtained, the data to be classified may be input into the model, the output of the model is the probability of the plurality of labels corresponding to each category, then the probabilities of the plurality of sub-labels corresponding to each category are multiplied by the correction matrix respectively, an average value is taken according to the categories to serve as a corrected third probability, and the maximum value in the third probability of each category is used as a third classification category of the data.

For example, after the data B to be classified is input into the model, the model may output probabilities of 10 sub-labels, that is, probabilities of [ "good", "positive", "satisfactory", "excellent", "bar" ] and probabilities of [ "poor", "negative", "disapproval", "not good", "bad" ], then the probabilities of [ "good", "positive", "satisfactory", "excellent", "bar" ] are respectively multiplied by the correction matrix to find an average value, the probabilities of [ "poor", "negative", "disapproval", "not good", "bad" ] are respectively multiplied by the correction matrix to find an average value, and finally the corresponding class with the larger average value is taken as the final classification class of the data B to be classified.

In some embodiments, the data in the training set may also be corrected by using the correction method in 3, and then the result obtained by each correction method is compared with the label of the training set, and the highest precision corresponds to the best correction optimization scheme, and then the data is corrected by using the best correction scheme.

The invention adopts a correction optimization scheme combining label expansion and correction, not only eliminates the bias caused by different occurrence frequencies of labels in the pre-training corpus, but also corrects the bias caused by label words and input samples by replacing empty texts with the training set samples, so that the method can avoid retraining a huge amount of models, and greatly improves the accuracy rate and stability of downstream tasks. The optimization scheme provided by the invention is applied to the CLUE Chinese classification data set, a pre-trained billion parameter model 'source 1.0' is loaded, the news classification accuracy can be improved by 5 percent (52.09% before correction and 57.47% after correction) through tests, the scientific literature subject classification accuracy can be improved by 7 percent (39.02% before correction and 46.57% after correction), the long text classification accuracy can be improved by 4 percent (34.89% before correction and 38.82% after correction) through application description, and the emotion classification accuracy of E-commerce products can be improved by 35 percent (51.25% before correction and 86.88% after correction).

The existing correction method can only correct the bias of the model to the label through no text, and can not correct the bias brought by the input sample. That is, existing methods correct all classes to an unbiased state, but the class distribution in the dataset is not uniform. In the invention, the training set sample is adopted to replace a null text optimization correction algorithm, so that the model can be corrected according to data distribution. Meanwhile, the labels are expanded, so that the bias caused by different occurrence frequencies of the labels in the pre-training corpus is eliminated.

Based on the same inventive concept, according to another aspect of the present invention, an embodiment of the present invention further provides a classification result correction system 400, as shown in fig. 3, including:

a construction module 401 configured to construct a data set and label each data in the data set with a classification label of a corresponding category;

a calculation module 402 configured to input each data in the data set into a trained model to obtain a probability of the corresponding class label and calculate a correction matrix using the class label probability corresponding to each data;

an expansion module 403 configured to expand the classification label of each category into a plurality of sub-labels;

an adjusting module 404 configured to adjust the output of the trained model to probabilities of a plurality of sub-labels corresponding to each category;

an input module 405 configured to input data to be classified into a trained model to obtain probabilities of a plurality of sub-labels corresponding to each class;

a correction module 406 configured to determine a final class of the data to be classified by using the probability of the plurality of sub-labels corresponding to each class and the correction matrix.

In some embodiments, the calculation module 402 is further configured to:

and obtaining the correction matrix after inverting the diagonal matrix.

In some embodiments, the expansion module 403 is further configured to:

inputting each data in the data set into a trained model to obtain the probability of the remaining synonyms;

In some embodiments, the correction module 406 is further configured to:

Based on the same inventive concept, according to another aspect of the present invention, as shown in fig. 4, an embodiment of the present invention further provides a computer apparatus 501, comprising:

at least one processor 520; and

a memory 510, the memory 510 storing a computer program 511 executable on the processor, the processor 520 executing the program to perform the steps of:

and obtaining the correction matrix after inverting the diagonal matrix.

In some embodiments, expanding the classification label for each category into a plurality of sub-labels further comprises:

Based on the same inventive concept, according to another aspect of the present invention, as shown in fig. 5, an embodiment of the present invention further provides a computer-readable storage medium 601, where the computer-readable storage medium 601 stores computer program instructions 610, and the computer program instructions 610, when executed by a processor, perform the following steps:

and obtaining the correction matrix after inverting the diagonal matrix.

Finally, it should be noted that, as will be understood by those skilled in the art, all or part of the processes of the methods of the above embodiments may be implemented by a computer program, which may be stored in a computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above.

Further, it should be appreciated that the computer-readable storage media (e.g., memory) herein can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as software or hardware depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the disclosed embodiments of the present invention.

The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the present disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.

It should be understood that, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.

The numbers of the embodiments disclosed in the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant only to be exemplary, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the idea of an embodiment of the invention, also technical features in the above embodiment or in different embodiments may be combined and there are many other variations of the different aspects of the embodiments of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.

Claims

1. A classification result correction method is characterized by comprising the following steps:

determining the final class of the data to be classified by using the probability of the plurality of sub-labels corresponding to each class and the correction matrix;

wherein inputting each data in the data set into a trained model to obtain a probability of the corresponding class label and calculating a correction matrix using the class label probability corresponding to each data, further comprises:

and obtaining the correction matrix after inverting the diagonal matrix.

2. The method of claim 1, wherein the classification label for each category is expanded into a plurality of sub-labels, further comprising:

3. The method of claim 2, wherein a preset number of words are respectively screened from the plurality of near-meaning words corresponding to each category as the plurality of sub-labels corresponding to each category, further comprising:

4. The method of claim 1, wherein determining a final class of the data to be classified using the probability of the plurality of sub-labels for each class and the correction matrix, further comprises:

5. The method of claim 1, wherein determining a final class of the data to be classified using the probability of the plurality of sub-labels for each class and the correction matrix, further comprises:

6. The method of claim 1, wherein determining a final class of the data to be classified using the probability of the plurality of sub-labels for each class and the correction matrix, further comprises:

7. A classification result correction system, comprising:

a calculation module configured to input each data in the data set into a trained model to obtain a probability of the corresponding classification label and calculate a correction matrix using the probability of the classification label corresponding to each data;

the correction module is configured to determine a final class of the data to be classified by using the probability of the plurality of sub-labels corresponding to each class and the correction matrix;

the calculation module is further configured to sum and average the probability of the classification label corresponding to each data according to the category to obtain the probability corresponding to each category;

and obtaining the correction matrix after inverting the diagonal matrix.

8. A computer device, comprising:

at least one processor; and

memory storing a computer program operable on the processor, characterized in that the processor executes the program to perform the steps of the method according to any of claims 1-6.

9. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, is adapted to carry out the steps of the method according to any one of claims 1-6.