CN114186065B - Classification result correction method, system, device and medium - Google Patents

Classification result correction method, system, device and medium Download PDF

Info

Publication number
CN114186065B
CN114186065B CN202210133548.8A CN202210133548A CN114186065B CN 114186065 B CN114186065 B CN 114186065B CN 202210133548 A CN202210133548 A CN 202210133548A CN 114186065 B CN114186065 B CN 114186065B
Authority
CN
China
Prior art keywords
probability
data
category
class
sub
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210133548.8A
Other languages
Chinese (zh)
Other versions
CN114186065A (en
Inventor
刘红丽
李峰
于彤
周镇镇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202210133548.8A priority Critical patent/CN114186065B/en
Publication of CN114186065A publication Critical patent/CN114186065A/en
Application granted granted Critical
Publication of CN114186065B publication Critical patent/CN114186065B/en
Priority to PCT/CN2022/122302 priority patent/WO2023151284A1/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a classification result correction method, which comprises the following steps: constructing a data set and labeling each data in the data set with a classification label of a corresponding category; inputting each data in the data set into the trained model to obtain the probability of the corresponding classification label and calculating a correction matrix by using the probability of the classification label corresponding to each data; expanding the classification label of each category into a plurality of sub-labels; adjusting the output of the trained model to be the probability of a plurality of sub-labels corresponding to each category; inputting data to be classified into a trained model to obtain the probability of a plurality of sub-labels corresponding to each category; and determining the final class of the data to be classified by using the probability of the plurality of sub-labels corresponding to each class and the correction matrix. The invention also discloses a system, a computer device and a readable storage medium. The scheme provided by the invention eliminates the bias caused by different occurrence frequencies of the tags by expanding the tags.

Description

Classification result correction method, system, device and medium
Technical Field
The present invention relates to the field of classification, and in particular, to a method, system, device, and storage medium for correcting a classification result.
Background
The most central abilities of the massive model are zero-sample learning and small-sample learning. I.e. when facing different application tasks, the model does not need to be retrained. However, the massive models can bring bias from the corpus during pre-training, resulting in low accuracy or unstable performance of downstream tasks. The existing solution is to compensate for the biased tagged words by text-free input, calibrate them to an unbiased state, and reduce the differences between different prompt options. However, due to the difference of the occurrence frequency of the labels in the pre-training corpus, the model has preference for the prediction result, that is, the output accuracy of the model is low. Therefore, the existing correction method can only correct the bias of the model to the label and cannot correct the bias brought by the input sample.
Disclosure of Invention
In view of the above, in order to overcome at least one aspect of the above problems, an embodiment of the present invention provides a classification result correction method, including the following steps:
constructing a data set and labeling each data in the data set with a classification label of a corresponding category;
inputting each data in the data set into a trained model to obtain the probability of the corresponding classification label and calculating a correction matrix by using the probability of the classification label corresponding to each data;
expanding the classification label of each category into a plurality of sub-labels;
adjusting the output of the trained model to be the probability of a plurality of sub-labels corresponding to each category;
inputting data to be classified into a trained model to obtain the probability of a plurality of sub-labels corresponding to each class;
and determining the final class of the data to be classified by utilizing the probability of the plurality of sub-labels corresponding to each class and the correction matrix.
In some embodiments, inputting each data in the data set into a trained model to obtain a probability of the corresponding class label and calculating a correction matrix using the class label probability corresponding to each data, further comprises:
the probability of the classification label corresponding to each data is summed according to the classes and averaged to obtain the probability corresponding to each class;
normalizing the probability corresponding to each category and then constructing a diagonal matrix;
and obtaining the correction matrix after inverting the diagonal matrix.
In some embodiments, expanding the classification label of each category into a plurality of sub-labels further comprises:
obtaining a plurality of similar meaning words corresponding to the classification labels of each category by using a preset model;
and respectively screening a preset number of words from the plurality of similar words corresponding to each category as the plurality of sub-labels corresponding to each category.
In some embodiments, respectively screening a preset number of words from the plurality of near-meaning words corresponding to each category as the plurality of sub-labels corresponding to each category, further includes:
deleting words in the vocabulary which are not present in the trained model from the plurality of similar words;
adjusting the output of the trained model to the probabilities of the remaining near-synonyms;
inputting each data in the data set into a trained model to obtain a probability of the remaining near-synonyms;
deleting words with the probability lower than a first threshold value in the residual similar words according to the probability of the residual similar words output by the trained model;
deleting the words with the probability difference value smaller than the second threshold value in the remaining similar words, and selecting the words with the maximum probability in the preset number as the plurality of sub-labels corresponding to each category.
In some embodiments, determining a final class of the data to be classified by using the probability of the plurality of sub-labels corresponding to each class and the correction matrix further includes:
and calculating the average value of the probabilities of the plurality of sub-labels corresponding to each class according to the class, multiplying the average value corresponding to each class by the correction matrix to obtain a corrected first probability, and taking the maximum value in the first probability of each class as the classification class of the data.
In some embodiments, determining a final class of the data to be classified by using the probability of the plurality of sub-labels corresponding to each class and the correction matrix further includes:
and multiplying the maximum value in the probabilities of the plurality of sub-labels corresponding to each class by the correction matrix to obtain a corrected second probability, and taking the class corresponding to the sub-label with the maximum probability as a second classification class of the data.
In some embodiments, determining a final class of the data to be classified by using the probability of the plurality of sub-labels corresponding to each class and the correction matrix further includes:
and respectively multiplying the probabilities of the plurality of sub-labels corresponding to each category by the correction matrix, then averaging according to the categories to serve as a corrected third probability, and taking the maximum value in the third probability of each category as a third classification category of the data.
Based on the same inventive concept, according to another aspect of the present invention, an embodiment of the present invention further provides a classification result correction system, including:
the building module is configured to build a data set and label each data in the data set with a classification label of a corresponding category;
a calculation module configured to input each data in the data set into a trained model to obtain a probability of the corresponding classification label and calculate a correction matrix using the classification label probability corresponding to each data;
an expansion module configured to expand the classification label of each category into a plurality of sub-labels;
an adjustment module configured to adjust an output of the trained model to probabilities of a plurality of sub-labels corresponding to each category;
the input module is configured to input the data to be classified into the trained model to obtain the probability of a plurality of sub-labels corresponding to each class;
and the correction module is configured to determine the final category of the data to be classified by using the probability of the plurality of sub-labels corresponding to each category and the correction matrix.
Based on the same inventive concept, according to another aspect of the present invention, an embodiment of the present invention further provides a computer apparatus, including:
at least one processor; and
a memory storing a computer program operable on the processor, wherein the processor executes the program to perform any of the steps of the classification result correction method as described above.
Based on the same inventive concept, according to another aspect of the present invention, an embodiment of the present invention further provides a computer-readable storage medium storing a computer program which, when executed by a processor, performs the steps of any one of the classification result correction methods described above.
The invention has one of the following beneficial technical effects: according to the scheme provided by the invention, the labels are expanded, so that the offset caused by different label occurrence frequencies in the pre-training corpus is eliminated, the empty text is replaced by the training set sample, and the offset caused by the label words and the input sample is corrected.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by using the drawings without creative efforts.
Fig. 1 is a schematic flow chart of a classification result correction method according to an embodiment of the present invention;
FIG. 2 is a flow diagram of tag expansion provided by an embodiment of the present invention;
FIG. 3 is a schematic structural diagram of a classification result correction system according to an embodiment of the present invention;
FIG. 4 is a schematic structural diagram of a computer device provided in an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a computer-readable storage medium according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.
It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it should be noted that "first" and "second" are merely for convenience of description and should not be construed as limitations of the embodiments of the present invention, and they are not described in any more detail in the following embodiments.
According to an aspect of the present invention, an embodiment of the present invention provides a classification result correction method, as shown in fig. 1, which may include the steps of:
s1, constructing a data set and labeling each data in the data set with a classification label of a corresponding category;
s2, inputting each data in the data set into a trained model to obtain the probability of the corresponding classification label and calculating a correction matrix by using the probability of the classification label corresponding to each data;
s3, expanding the classification label of each category into a plurality of sub-labels;
s4, adjusting the output of the trained model to be the probability of a plurality of sub-labels corresponding to each category;
s5, inputting the data to be classified into the trained model to obtain the probability of a plurality of sub-labels corresponding to each class;
s6, determining the final category of the data to be classified by using the probability of the plurality of sub-labels corresponding to each category and the correction matrix.
According to the scheme provided by the invention, the labels are expanded, so that the offset caused by different label occurrence frequencies in the pre-training corpus is eliminated, the empty text is replaced by the training set sample, and the offset caused by the label words and the input sample is corrected.
In some embodiments, step S2, inputting each data in the data set into a trained model to obtain the probability of the corresponding class label, and calculating a correction matrix using the probability of the class label corresponding to each data, further includes:
the probability of the classification label corresponding to each data is summed according to the classes and averaged to obtain the probability corresponding to each class;
normalizing the probability corresponding to each category and then constructing a diagonal matrix;
and obtaining the correction matrix after inverting the diagonal matrix.
Specifically, after each data in the training set is input into a model (PLM, pre-training language model), the label probabilities of corresponding categories can be obtained, then all the label probabilities of the same category are summed and averaged, a diagonal matrix is constructed after normalization processing, and then the inverse matrix of the diagonal matrix is obtained to obtain the final correction matrix.
For example, after the data a is input into the model, the label probability of the data a can be obtained, where the label corresponds to the first category, and thus the probabilities corresponding to the first category can be obtained by summing and averaging the label probabilities of all the first categories, then the probabilities corresponding to all the categories are normalized, and then the diagonal matrix is constructed and the inverse matrix of the diagonal matrix is obtained to obtain the final correction matrix.
In some embodiments, S3, expanding the classification label of each category into a plurality of sub-labels, further comprises:
obtaining a plurality of similar meaning words corresponding to the classification labels of each category by using a preset model;
and respectively screening a preset number of words from the plurality of similar words corresponding to each category as the plurality of sub-labels corresponding to each category.
In some embodiments, respectively screening a preset number of words from the plurality of near-meaning words corresponding to each category as the plurality of sub-labels corresponding to each category, further includes:
deleting words in the vocabulary which are not present in the trained model from the plurality of similar words;
adjusting the output of the trained model to the probabilities of the remaining near-synonyms;
inputting each data in the data set into a trained model to obtain a probability of the remaining near-synonyms;
deleting words with the probability lower than a first threshold value in the residual similar words according to the probability of the residual similar words output by the trained model;
deleting the words with the probability difference value smaller than the second threshold value in the remaining similar words, and selecting the words with the maximum probability in the preset number as the plurality of sub-labels corresponding to each category.
Specifically, the expansion label mapping word list can be constructed by outputting the near-sense words through the word2vec model. For selection of near-meaning words, an embedded corpus covering more than 800 ten thousand Chinese words and phrases can be used for training a word2vec model to reveal the correlation among the words, and then a plurality of near-meaning words output by the word2vec model are screened to obtain a plurality of final sub-labels.
As shown in fig. 2, it may be first checked in a traversal manner whether each near word is in the vocabulary space of the model, and the label mapping words not in the vocabulary space are deleted.
It should be noted that each word in the vocabulary space can be used as a tag. The probability of each word in the word list space can be output by the model when one datum is input into the model, but the labels output by the model can be adjusted according to actual requirements due to more words in the word list space.
Each data in the training set can be input into the model to obtain the probabilities of the remaining near-meaning words, and the near-meaning words with the probabilities lower than the average value are divided into rare words, which can cause the prediction probability to be inaccurate, so that the rare words need to be deleted.
Then, because the probabilities of the synonyms predicted by the model are very similar, the tag expansion loses significance, so that the synonyms with the similar probability values can be deleted, the synonym with the highest prediction probability in the synonyms is reserved, and the rest synonyms are deleted.
Finally, the first N of the remaining word lists are selected as the final expanded tag mapping word list, for example, N = min (5, number of tag mapping words after filtering).
Through the above process, each label is expanded into N similar meaning words, so that the deviation caused by a single label is eliminated.
For example, after the tags are expanded, the tags "good rating" of the first category (e.g., the front side) are expanded to [ "good", "front", "satisfactory", "excellent", "bar" ], and the tags "bad rating" of the second category (e.g., the back side) are expanded to [ "bad", "negative", "disappointed", "bad" ].
Ideally, the frequency of occurrence of all tags in the pre-corpus should be approximately the same. However, in experiments, the occurrence frequency of the tags in the corpus is found to be different, so that the model has preference on the prediction result. In practical applications, it is very difficult to manually select qualified label mapping words from nearly 6 ten thousand of word list spaces, and subjective factors are often introduced. Therefore, the label mapping words are expanded by adopting the screening mode,
in some embodiments, determining a final class of the data to be classified by using the probability of the plurality of sub-labels corresponding to each class and the correction matrix further includes:
and calculating the average value of the probabilities of the plurality of sub-labels corresponding to each class according to the class, multiplying the average value corresponding to each class by the correction matrix to obtain a corrected first probability, and taking the maximum value in the first probability of each class as the classification class of the data.
Specifically, after the correction matrix is obtained, the data to be classified may be input into the model, where the output of the model is the probability of the plurality of labels corresponding to each category, then the average value of the probabilities of the plurality of sub-labels corresponding to each category is calculated according to the category, and then the average value corresponding to each category is multiplied by the correction matrix to be used as the corrected first probability, and the maximum value in the first probability of each category is used as the classification category of the data.
For example, after the data B to be classified is input into the model, the model may output probabilities of 10 sub-labels, that is, probabilities of [ "good", "positive", "satisfactory", "excellent", "bar" ] and probabilities of [ "poor", "negative", "disapproval", "not good", "bad" ], then averages the probabilities of [ "good", "positive", "satisfactory", "excellent", "bar" ], and the probabilities of [ "poor", "negative", "disapproval", "not good", "bad" ], then multiplies the average values by the correction matrix, and finally takes the class corresponding to the average value with the largest probability value as the final classification class of the data B to be classified.
In some embodiments, determining a final class of the data to be classified by using the probability of the plurality of sub-labels corresponding to each class and the correction matrix further includes:
and multiplying the maximum value in the probabilities of the plurality of sub-labels corresponding to each class by the correction matrix to obtain a corrected second probability, and taking the class corresponding to the sub-label with the maximum probability as a second classification class of the data.
Specifically, after the correction matrix is obtained, the data to be classified may be input into the model, where the output of the model is the probability of the plurality of labels corresponding to each category, then the maximum value of the probabilities of the plurality of sub-labels corresponding to each category is multiplied by the correction matrix to be used as the corrected second probability, and the category corresponding to the sub-label with the highest probability is used as the second classification category of the data.
For example, after the data B to be classified is input into the model, the model may output probabilities of 10 sub-labels, that is, probabilities of [ "good", "positive", "satisfactory", "excellent", "bar" ] and probabilities of [ "poor", "negative", "disapproval", "not good", "bad" ], then multiply the maximum value of the probabilities of [ "good", "positive", "satisfactory", "excellent", "bar" ] by the correction matrix, and the maximum value of the probabilities of [ "poor", "negative", "disapproval", "not good", "bad" ] by the correction matrix, and finally take the corresponding category having a large probability value as the final classification category of the data B to be classified.
In some embodiments, determining a final class of the data to be classified by using the probability of the plurality of sub-labels corresponding to each class and the correction matrix further includes:
and respectively multiplying the probabilities of the plurality of sub-labels corresponding to each category by the correction matrix, then averaging according to the categories to serve as a corrected third probability, and taking the maximum value in the third probability of each category as a third classification category of the data.
Specifically, after the correction matrix is obtained, the data to be classified may be input into the model, the output of the model is the probability of the plurality of labels corresponding to each category, then the probabilities of the plurality of sub-labels corresponding to each category are multiplied by the correction matrix respectively, an average value is taken according to the categories to serve as a corrected third probability, and the maximum value in the third probability of each category is used as a third classification category of the data.
For example, after the data B to be classified is input into the model, the model may output probabilities of 10 sub-labels, that is, probabilities of [ "good", "positive", "satisfactory", "excellent", "bar" ] and probabilities of [ "poor", "negative", "disapproval", "not good", "bad" ], then the probabilities of [ "good", "positive", "satisfactory", "excellent", "bar" ] are respectively multiplied by the correction matrix to find an average value, the probabilities of [ "poor", "negative", "disapproval", "not good", "bad" ] are respectively multiplied by the correction matrix to find an average value, and finally the corresponding class with the larger average value is taken as the final classification class of the data B to be classified.
In some embodiments, the data in the training set may also be corrected by using the correction method in 3, and then the result obtained by each correction method is compared with the label of the training set, and the highest precision corresponds to the best correction optimization scheme, and then the data is corrected by using the best correction scheme.
The invention adopts a correction optimization scheme combining label expansion and correction, not only eliminates the bias caused by different occurrence frequencies of labels in the pre-training corpus, but also corrects the bias caused by label words and input samples by replacing empty texts with the training set samples, so that the method can avoid retraining a huge amount of models, and greatly improves the accuracy rate and stability of downstream tasks. The optimization scheme provided by the invention is applied to the CLUE Chinese classification data set, a pre-trained billion parameter model 'source 1.0' is loaded, the news classification accuracy can be improved by 5 percent (52.09% before correction and 57.47% after correction) through tests, the scientific literature subject classification accuracy can be improved by 7 percent (39.02% before correction and 46.57% after correction), the long text classification accuracy can be improved by 4 percent (34.89% before correction and 38.82% after correction) through application description, and the emotion classification accuracy of E-commerce products can be improved by 35 percent (51.25% before correction and 86.88% after correction).
The existing correction method can only correct the bias of the model to the label through no text, and can not correct the bias brought by the input sample. That is, existing methods correct all classes to an unbiased state, but the class distribution in the dataset is not uniform. In the invention, the training set sample is adopted to replace a null text optimization correction algorithm, so that the model can be corrected according to data distribution. Meanwhile, the labels are expanded, so that the bias caused by different occurrence frequencies of the labels in the pre-training corpus is eliminated.
Based on the same inventive concept, according to another aspect of the present invention, an embodiment of the present invention further provides a classification result correction system 400, as shown in fig. 3, including:
a construction module 401 configured to construct a data set and label each data in the data set with a classification label of a corresponding category;
a calculation module 402 configured to input each data in the data set into a trained model to obtain a probability of the corresponding class label and calculate a correction matrix using the class label probability corresponding to each data;
an expansion module 403 configured to expand the classification label of each category into a plurality of sub-labels;
an adjusting module 404 configured to adjust the output of the trained model to probabilities of a plurality of sub-labels corresponding to each category;
an input module 405 configured to input data to be classified into a trained model to obtain probabilities of a plurality of sub-labels corresponding to each class;
a correction module 406 configured to determine a final class of the data to be classified by using the probability of the plurality of sub-labels corresponding to each class and the correction matrix.
In some embodiments, the calculation module 402 is further configured to:
the probability of the classification label corresponding to each data is summed according to the classes and averaged to obtain the probability corresponding to each class;
normalizing the probability corresponding to each category and then constructing a diagonal matrix;
and obtaining the correction matrix after inverting the diagonal matrix.
In some embodiments, the expansion module 403 is further configured to:
obtaining a plurality of similar meaning words corresponding to the classification labels of each category by using a preset model;
and respectively screening a preset number of words from the plurality of similar words corresponding to each category as the plurality of sub-labels corresponding to each category.
In some embodiments, the expansion module 403 is further configured to:
deleting words in the vocabulary which are not present in the trained model from the plurality of similar words;
adjusting the output of the trained model to the probabilities of the remaining near-synonyms;
inputting each data in the data set into a trained model to obtain the probability of the remaining synonyms;
deleting words with the probability lower than a first threshold value in the residual similar words according to the probability of the residual similar words output by the trained model;
deleting the words with the probability difference value smaller than the second threshold value in the remaining similar words, and selecting the words with the maximum probability in the preset number as the plurality of sub-labels corresponding to each category.
In some embodiments, the correction module 406 is further configured to:
and calculating the average value of the probabilities of the plurality of sub-labels corresponding to each class according to the class, multiplying the average value corresponding to each class by the correction matrix to obtain a corrected first probability, and taking the maximum value in the first probability of each class as the classification class of the data.
In some embodiments, the correction module 406 is further configured to:
and multiplying the maximum value in the probabilities of the plurality of sub-labels corresponding to each class by the correction matrix to obtain a corrected second probability, and taking the class corresponding to the sub-label with the maximum probability as a second classification class of the data.
In some embodiments, the correction module 406 is further configured to:
and respectively multiplying the probabilities of the plurality of sub-labels corresponding to each category by the correction matrix, then averaging according to the categories to serve as a corrected third probability, and taking the maximum value in the third probability of each category as a third classification category of the data.
Based on the same inventive concept, according to another aspect of the present invention, as shown in fig. 4, an embodiment of the present invention further provides a computer apparatus 501, comprising:
at least one processor 520; and
a memory 510, the memory 510 storing a computer program 511 executable on the processor, the processor 520 executing the program to perform the steps of:
s1, constructing a data set and labeling each data in the data set with a classification label of a corresponding category;
s2, inputting each data in the data set into a trained model to obtain the probability of the corresponding classification label and calculating a correction matrix by using the probability of the classification label corresponding to each data;
s3, expanding the classification label of each category into a plurality of sub-labels;
s4, adjusting the output of the trained model to be the probability of a plurality of sub-labels corresponding to each category;
s5, inputting the data to be classified into the trained model to obtain the probability of a plurality of sub-labels corresponding to each class;
s6, determining the final category of the data to be classified by using the probability of the plurality of sub-labels corresponding to each category and the correction matrix.
In some embodiments, inputting each data in the data set into a trained model to obtain a probability of the corresponding class label and calculating a correction matrix using the class label probability corresponding to each data, further comprises:
the probability of the classification label corresponding to each data is summed according to the classes and averaged to obtain the probability corresponding to each class;
normalizing the probability corresponding to each category and then constructing a diagonal matrix;
and obtaining the correction matrix after inverting the diagonal matrix.
In some embodiments, expanding the classification label for each category into a plurality of sub-labels further comprises:
obtaining a plurality of similar meaning words corresponding to the classification labels of each category by using a preset model;
and respectively screening a preset number of words from the plurality of similar words corresponding to each category as the plurality of sub-labels corresponding to each category.
In some embodiments, respectively screening a preset number of words from the plurality of near-meaning words corresponding to each category as the plurality of sub-labels corresponding to each category, further includes:
deleting words in the vocabulary which are not present in the trained model from the plurality of similar words;
adjusting the output of the trained model to the probabilities of the remaining near-synonyms;
inputting each data in the data set into a trained model to obtain a probability of the remaining near-synonyms;
deleting words with the probability lower than a first threshold value in the residual similar words according to the probability of the residual similar words output by the trained model;
deleting the words with the probability difference value smaller than the second threshold value in the remaining similar words, and selecting the words with the maximum probability in the preset number as the plurality of sub-labels corresponding to each category.
In some embodiments, determining a final class of the data to be classified by using the probability of the plurality of sub-labels corresponding to each class and the correction matrix further includes:
and calculating the average value of the probabilities of the plurality of sub-labels corresponding to each class according to the class, multiplying the average value corresponding to each class by the correction matrix to obtain a corrected first probability, and taking the maximum value in the first probability of each class as the classification class of the data.
In some embodiments, determining a final class of the data to be classified by using the probability of the plurality of sub-labels corresponding to each class and the correction matrix further includes:
and multiplying the maximum value in the probabilities of the plurality of sub-labels corresponding to each class by the correction matrix to obtain a corrected second probability, and taking the class corresponding to the sub-label with the maximum probability as a second classification class of the data.
In some embodiments, determining a final class of the data to be classified by using the probability of the plurality of sub-labels corresponding to each class and the correction matrix further includes:
and respectively multiplying the probabilities of the plurality of sub-labels corresponding to each category by the correction matrix, then averaging according to the categories to serve as a corrected third probability, and taking the maximum value in the third probability of each category as a third classification category of the data.
Based on the same inventive concept, according to another aspect of the present invention, as shown in fig. 5, an embodiment of the present invention further provides a computer-readable storage medium 601, where the computer-readable storage medium 601 stores computer program instructions 610, and the computer program instructions 610, when executed by a processor, perform the following steps:
s1, constructing a data set and labeling each data in the data set with a classification label of a corresponding category;
s2, inputting each data in the data set into a trained model to obtain the probability of the corresponding classification label and calculating a correction matrix by using the probability of the classification label corresponding to each data;
s3, expanding the classification label of each category into a plurality of sub-labels;
s4, adjusting the output of the trained model to be the probability of a plurality of sub-labels corresponding to each category;
s5, inputting the data to be classified into the trained model to obtain the probability of a plurality of sub-labels corresponding to each class;
s6, determining the final category of the data to be classified by using the probability of the plurality of sub-labels corresponding to each category and the correction matrix.
In some embodiments, inputting each data in the data set into a trained model to obtain a probability of the corresponding class label and calculating a correction matrix using the class label probability corresponding to each data, further comprises:
the probability of the classification label corresponding to each data is summed according to the classes and averaged to obtain the probability corresponding to each class;
normalizing the probability corresponding to each category and then constructing a diagonal matrix;
and obtaining the correction matrix after inverting the diagonal matrix.
In some embodiments, expanding the classification label for each category into a plurality of sub-labels further comprises:
obtaining a plurality of similar meaning words corresponding to the classification labels of each category by using a preset model;
and respectively screening a preset number of words from the plurality of similar words corresponding to each category as the plurality of sub-labels corresponding to each category.
In some embodiments, respectively screening a preset number of words from the plurality of near-meaning words corresponding to each category as the plurality of sub-labels corresponding to each category, further includes:
deleting words in the vocabulary which are not present in the trained model from the plurality of similar words;
adjusting the output of the trained model to the probabilities of the remaining near-synonyms;
inputting each data in the data set into a trained model to obtain a probability of the remaining near-synonyms;
deleting words with the probability lower than a first threshold value in the residual similar words according to the probability of the residual similar words output by the trained model;
deleting the words with the probability difference value smaller than the second threshold value in the remaining similar words, and selecting the words with the maximum probability in the preset number as the plurality of sub-labels corresponding to each category.
In some embodiments, determining a final class of the data to be classified by using the probability of the plurality of sub-labels corresponding to each class and the correction matrix further includes:
and calculating the average value of the probabilities of the plurality of sub-labels corresponding to each class according to the class, multiplying the average value corresponding to each class by the correction matrix to obtain a corrected first probability, and taking the maximum value in the first probability of each class as the classification class of the data.
In some embodiments, determining a final class of the data to be classified by using the probability of the plurality of sub-labels corresponding to each class and the correction matrix further includes:
and multiplying the maximum value in the probabilities of the plurality of sub-labels corresponding to each class by the correction matrix to obtain a corrected second probability, and taking the class corresponding to the sub-label with the maximum probability as a second classification class of the data.
In some embodiments, determining a final class of the data to be classified by using the probability of the plurality of sub-labels corresponding to each class and the correction matrix further includes:
and respectively multiplying the probabilities of the plurality of sub-labels corresponding to each category by the correction matrix, then averaging according to the categories to serve as a corrected third probability, and taking the maximum value in the third probability of each category as a third classification category of the data.
Finally, it should be noted that, as will be understood by those skilled in the art, all or part of the processes of the methods of the above embodiments may be implemented by a computer program, which may be stored in a computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above.
Further, it should be appreciated that the computer-readable storage media (e.g., memory) herein can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as software or hardware depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the disclosed embodiments of the present invention.
The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the present disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.
It should be understood that, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.
The numbers of the embodiments disclosed in the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant only to be exemplary, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the idea of an embodiment of the invention, also technical features in the above embodiment or in different embodiments may be combined and there are many other variations of the different aspects of the embodiments of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.

Claims (9)

1. A classification result correction method is characterized by comprising the following steps:
constructing a data set and labeling each data in the data set with a classification label of a corresponding category;
inputting each data in the data set into a trained model to obtain the probability of the corresponding classification label and calculating a correction matrix by using the probability of the classification label corresponding to each data;
expanding the classification label of each category into a plurality of sub-labels;
adjusting the output of the trained model to be the probability of a plurality of sub-labels corresponding to each category;
inputting data to be classified into a trained model to obtain the probability of a plurality of sub-labels corresponding to each class;
determining the final class of the data to be classified by using the probability of the plurality of sub-labels corresponding to each class and the correction matrix;
wherein inputting each data in the data set into a trained model to obtain a probability of the corresponding class label and calculating a correction matrix using the class label probability corresponding to each data, further comprises:
the probability of the classification label corresponding to each data is summed according to the classes and averaged to obtain the probability corresponding to each class;
normalizing the probability corresponding to each category and then constructing a diagonal matrix;
and obtaining the correction matrix after inverting the diagonal matrix.
2. The method of claim 1, wherein the classification label for each category is expanded into a plurality of sub-labels, further comprising:
obtaining a plurality of similar meaning words corresponding to the classification labels of each category by using a preset model;
and respectively screening a preset number of words from the plurality of similar words corresponding to each category as the plurality of sub-labels corresponding to each category.
3. The method of claim 2, wherein a preset number of words are respectively screened from the plurality of near-meaning words corresponding to each category as the plurality of sub-labels corresponding to each category, further comprising:
deleting words in the vocabulary which are not present in the trained model from the plurality of similar words;
adjusting the output of the trained model to the probabilities of the remaining near-synonyms;
inputting each data in the data set into a trained model to obtain a probability of the remaining near-synonyms;
deleting words with the probability lower than a first threshold value in the residual similar words according to the probability of the residual similar words output by the trained model;
deleting the words with the probability difference value smaller than the second threshold value in the remaining similar words, and selecting the words with the maximum probability in the preset number as the plurality of sub-labels corresponding to each category.
4. The method of claim 1, wherein determining a final class of the data to be classified using the probability of the plurality of sub-labels for each class and the correction matrix, further comprises:
and calculating the average value of the probabilities of the plurality of sub-labels corresponding to each class according to the class, multiplying the average value corresponding to each class by the correction matrix to obtain a corrected first probability, and taking the maximum value in the first probability of each class as the classification class of the data.
5. The method of claim 1, wherein determining a final class of the data to be classified using the probability of the plurality of sub-labels for each class and the correction matrix, further comprises:
and multiplying the maximum value in the probabilities of the plurality of sub-labels corresponding to each class by the correction matrix to obtain a corrected second probability, and taking the class corresponding to the sub-label with the maximum probability as a second classification class of the data.
6. The method of claim 1, wherein determining a final class of the data to be classified using the probability of the plurality of sub-labels for each class and the correction matrix, further comprises:
and respectively multiplying the probabilities of the plurality of sub-labels corresponding to each category by the correction matrix, then averaging according to the categories to serve as a corrected third probability, and taking the maximum value in the third probability of each category as a third classification category of the data.
7. A classification result correction system, comprising:
the building module is configured to build a data set and label each data in the data set with a classification label of a corresponding category;
a calculation module configured to input each data in the data set into a trained model to obtain a probability of the corresponding classification label and calculate a correction matrix using the probability of the classification label corresponding to each data;
an expansion module configured to expand the classification label of each category into a plurality of sub-labels;
an adjustment module configured to adjust an output of the trained model to probabilities of a plurality of sub-labels corresponding to each category;
the input module is configured to input the data to be classified into the trained model to obtain the probability of a plurality of sub-labels corresponding to each class;
the correction module is configured to determine a final class of the data to be classified by using the probability of the plurality of sub-labels corresponding to each class and the correction matrix;
the calculation module is further configured to sum and average the probability of the classification label corresponding to each data according to the category to obtain the probability corresponding to each category;
normalizing the probability corresponding to each category and then constructing a diagonal matrix;
and obtaining the correction matrix after inverting the diagonal matrix.
8. A computer device, comprising:
at least one processor; and
memory storing a computer program operable on the processor, characterized in that the processor executes the program to perform the steps of the method according to any of claims 1-6.
9. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, is adapted to carry out the steps of the method according to any one of claims 1-6.
CN202210133548.8A 2022-02-14 2022-02-14 Classification result correction method, system, device and medium Active CN114186065B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202210133548.8A CN114186065B (en) 2022-02-14 2022-02-14 Classification result correction method, system, device and medium
PCT/CN2022/122302 WO2023151284A1 (en) 2022-02-14 2022-09-28 Classification result correction method and system, device, and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210133548.8A CN114186065B (en) 2022-02-14 2022-02-14 Classification result correction method, system, device and medium

Publications (2)

Publication Number Publication Date
CN114186065A CN114186065A (en) 2022-03-15
CN114186065B true CN114186065B (en) 2022-05-17

Family

ID=80545885

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210133548.8A Active CN114186065B (en) 2022-02-14 2022-02-14 Classification result correction method, system, device and medium

Country Status (2)

Country Link
CN (1) CN114186065B (en)
WO (1) WO2023151284A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114186065B (en) * 2022-02-14 2022-05-17 苏州浪潮智能科技有限公司 Classification result correction method, system, device and medium

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2019185551A (en) * 2018-04-13 2019-10-24 株式会社Preferred Networks Annotation added text data expanding method, annotation added text data expanding program, annotation added text data expanding apparatus, and training method of text classification model
CN111382248B (en) * 2018-12-29 2023-05-23 深圳市优必选科技有限公司 Question replying method and device, storage medium and terminal equipment
CN110232397A (en) * 2019-04-22 2019-09-13 广东工业大学 A kind of multi-tag classification method of combination supporting vector machine and projection matrix
CN110490849A (en) * 2019-08-06 2019-11-22 桂林电子科技大学 Surface Defects in Steel Plate classification method and device based on depth convolutional neural networks
CN111326148B (en) * 2020-01-19 2021-02-23 北京世纪好未来教育科技有限公司 Confidence correction and model training method, device, equipment and storage medium thereof
CN111460150B (en) * 2020-03-27 2023-11-10 北京小米松果电子有限公司 Classification model training method, classification method, device and storage medium
EP3913538A1 (en) * 2020-05-20 2021-11-24 Robert Bosch GmbH Classification model calibration
CN113987136A (en) * 2021-11-29 2022-01-28 沈阳东软智能医疗科技研究院有限公司 Method, device and equipment for correcting text classification label and storage medium
CN114186065B (en) * 2022-02-14 2022-05-17 苏州浪潮智能科技有限公司 Classification result correction method, system, device and medium

Also Published As

Publication number Publication date
WO2023151284A1 (en) 2023-08-17
CN114186065A (en) 2022-03-15

Similar Documents

Publication Publication Date Title
US11081105B2 (en) Model learning device, method and recording medium for learning neural network model
US11080492B2 (en) Method and device for correcting error in text
CN105988990B (en) Chinese zero-reference resolution device and method, model training method and storage medium
US20210193161A1 (en) Acoustic model training method, speech recognition method, acoustic model training apparatus, speech recognition apparatus, acoustic model training program, and speech recognition program
CN110930993B (en) Specific domain language model generation method and voice data labeling system
US20200210520A1 (en) Determination of field types in tabular data
JP2015075706A (en) Error correction model learning device and program
KR20190133624A (en) A method and system for context sensitive spelling error correction using realtime candidate generation
TWI567569B (en) Natural language processing systems, natural language processing methods, and natural language processing programs
CN109522550B (en) Text information error correction method and device, computer equipment and storage medium
CN114186065B (en) Classification result correction method, system, device and medium
US11074406B2 (en) Device for automatically detecting morpheme part of speech tagging corpus error by using rough sets, and method therefor
CN114492363A (en) Small sample fine adjustment method, system and related device
CN112861521B (en) Speech recognition result error correction method, electronic device and storage medium
WO2018062265A1 (en) Acoustic model learning device, method therefor, and program
US20210406464A1 (en) Skill word evaluation method and device, electronic device, and non-transitory computer readable storage medium
CN110390093B (en) Language model building method and device
CN113963682A (en) Voice recognition correction method and device, electronic equipment and storage medium
US20100296728A1 (en) Discrimination Apparatus, Method of Discrimination, and Computer Program
CN113177405A (en) Method, device and equipment for correcting data errors based on BERT and storage medium
CN112183072A (en) Text error correction method and device, electronic equipment and readable storage medium
CN111797614A (en) Text processing method and device
CN111554295B (en) Text error correction method, related device and readable storage medium
CN111832288B (en) Text correction method and device, electronic equipment and storage medium
CN112651230A (en) Fusion language model generation method and device, word error correction method and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant