WO2023151284A1 - Classification result correction method and system, device, and medium - Google Patents

Classification result correction method and system, device, and medium Download PDF

Info

Publication number
WO2023151284A1
WO2023151284A1 PCT/CN2022/122302 CN2022122302W WO2023151284A1 WO 2023151284 A1 WO2023151284 A1 WO 2023151284A1 CN 2022122302 W CN2022122302 W CN 2022122302W WO 2023151284 A1 WO2023151284 A1 WO 2023151284A1
Authority
WO
WIPO (PCT)
Prior art keywords
category
probability
data
synonyms
labels
Prior art date
Application number
PCT/CN2022/122302
Other languages
French (fr)
Chinese (zh)
Inventor
刘红丽
李峰
于彤
周镇镇
Original Assignee
苏州浪潮智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州浪潮智能科技有限公司 filed Critical 苏州浪潮智能科技有限公司
Publication of WO2023151284A1 publication Critical patent/WO2023151284A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms

Abstract

The present application discloses a classification result correction method and system, a device, and a medium. The method comprises the steps: constructing a data set, and labeling each piece of data in the data set with a classification label of a corresponding category; inputting each piece of data in the data set into a trained model to obtain a probability of the corresponding classification label, and calculating a correction matrix by means of the classification label probability corresponding to each piece of data; expanding the classification label of each category into a plurality of sub-labels; adjusting the output of the trained model to be probabilities of the plurality of sub-labels corresponding to each category; inputting data to be classified into the trained model to obtain probabilities of the plurality of sub-labels corresponding to each category; and determining, by means of the probabilities of the plurality of sub-labels corresponding to each category and the correction matrix, the final category of the data to be classified. According to the solution provided in the present application, labels are expanded, and thus, bias caused by different frequencies of occurrence of the labels is eliminated.

Description

一种分类结果校正方法、系统、设备以及介质A classification result correction method, system, equipment and medium
本申请要求于2022年02月14日提交中国专利局,申请号为202210133548.8,申请名称为“一种分类结果校正方法、系统、设备以及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number 202210133548.8 and the application title "A Classification Result Correction Method, System, Equipment and Medium" submitted to the China Patent Office on February 14, 2022, the entire contents of which are incorporated by reference incorporated in this application.
技术领域technical field
本申请涉及分类领域,具体涉及一种分类结果校正方法、系统、设备以及存储介质。The present application relates to the field of classification, and in particular to a classification result correction method, system, device and storage medium.
背景技术Background technique
巨量模型最核心的能力是零样本学习和小样本学习能力。即在面临不同应用任务的时候,无需对模型做重新训练。但是巨量模型在预训练中会从语料库带来偏置,导致下游任务精度低或性能不稳定。目前已有解决的方法是通过无文本输入对带偏置的标签词进行补偿,把它们校准为无偏状态,减少不同提示选择之间的差异。但是由于标签在预训练语料中出现的频率存在差异,会导致模型对预测结果有偏好性,即模型输出精度低。因此目前已有校正方法只能校正模型对标签的偏置,无法校正输入样本带来的偏置。The core capabilities of the massive model are zero-sample learning and small-sample learning. That is, when faced with different application tasks, there is no need to retrain the model. However, the huge amount of models will bring bias from the corpus during pre-training, resulting in low accuracy or unstable performance of downstream tasks. The current solution is to compensate the biased label words through text-free input, calibrate them to an unbiased state, and reduce the difference between different prompt choices. However, due to the difference in the frequency of tags appearing in the pre-training corpus, the model will have a preference for the prediction results, that is, the model output accuracy is low. Therefore, the existing correction methods can only correct the bias of the model to the label, but cannot correct the bias brought by the input samples.
发明内容Contents of the invention
有鉴于此,为了克服上述问题的至少一个方面,本申请实施例提出一种分类结果校正方法,包括以下步骤:In view of this, in order to overcome at least one aspect of the above problems, an embodiment of the present application proposes a classification result correction method, including the following steps:
构建数据集并对所述数据集中的每一个数据标注一个对应类别的分类标签;Constructing a data set and marking each data in the data set with a classification label of a corresponding category;
将所述数据集中的每一个数据输入到已训练模型中以得到对应的所述分类标签的概率并利用每一个数据对应的所述分类标签概率计算校正矩阵;Input each data in the data set into the trained model to obtain the probability of the corresponding classification label and calculate a correction matrix using the classification label probability corresponding to each data;
将每一个类别的所述分类标签扩展为多个子标签;expanding said classification label for each category into a plurality of sub-labels;
将所述已训练模型的输出调整为每一个类别对应的多个子标签的概率;adjusting the output of the trained model to the probability of a plurality of sub-labels corresponding to each category;
将待分类的数据输入到已训练模型中以得到每一个类别对应的多个子标签 的概率;Input the data to be classified into the trained model to obtain the probability of multiple sub-labels corresponding to each category;
利用每一个类别对应的所述多个子标签的概率和所述校正矩阵确定所述待分类数据最终的类别。The final category of the data to be classified is determined by using the probabilities of the plurality of sub-labels corresponding to each category and the correction matrix.
在一些实施例中,将所述数据集中的每一个数据输入到已训练模型中以得到对应的所述分类标签的概率并利用每一个数据对应的所述分类标签概率计算校正矩阵,进一步包括:In some embodiments, each data in the data set is input into the trained model to obtain the probability of the corresponding classification label and the correction matrix is calculated using the classification label probability corresponding to each data, further comprising:
将每一个数据对应的分类标签的概率按类别求和取均值以得到每一个类别对应的概率;The probability of the classification label corresponding to each data is summed and averaged by category to obtain the probability corresponding to each category;
对每一个类别对应的概率进行归一化处理后构建对角矩阵;After normalizing the probability corresponding to each category, a diagonal matrix is constructed;
将所述对角矩阵求逆后得到所述校正矩阵。The correction matrix is obtained after inverting the diagonal matrix.
在一些实施例中,将每一个类别的所述分类标签扩展为多个子标签,进一步包括:In some embodiments, expanding the classification label of each category into a plurality of sub-labels, further comprising:
利用预设模型获取每一个类别的所述分类标签对应的多个近义词;Using a preset model to obtain a plurality of synonyms corresponding to the classification labels of each category;
分别从每一个类别对应的所述多个近义词中筛选预设数量的词作为每一个类别对应的所述多个子标签。Respectively select a preset number of words from the plurality of synonyms corresponding to each category as the plurality of subtags corresponding to each category.
在一些实施例中,在所述利用预设模型获取每一个类别的所述分类标签对应的多个近义词之前,还包括:In some embodiments, before using the preset model to obtain multiple synonyms corresponding to the classification labels of each category, it also includes:
利用预设数量的中文单词短语的嵌入语料库,训练得到所述预设模型。The preset model is obtained through training with a preset number of embedded corpora of Chinese word phrases.
在一些实施例中,分别从每一个类别对应的所述多个近义词中筛选预设数量的词作为每一个类别对应的所述多个子标签,进一步包括:In some embodiments, selecting a preset number of words from the plurality of synonyms corresponding to each category as the plurality of subtags corresponding to each category further includes:
将所述多个近义词中不存在于所述已训练模型的词表中词删除;Deleting words that do not exist in the vocabulary of the trained model in the plurality of synonyms;
将所述已训练模型的输出调整为剩余的近义词的概率;adjusting the output of the trained model to the probabilities of the remaining synonyms;
将所述数据集中的每一个数据输入到已训练模型中以得到所述剩余的近义词的概率;Input each data in the data set into the trained model to obtain the probability of the remaining synonyms;
根据所述已训练模型输出的所述剩余的近义词的概率将所述剩余的近义词中概率低于第一阈值的词删除;deleting the words whose probability is lower than the first threshold among the remaining synonyms according to the probability of the remaining synonyms output by the trained model;
将再次剩余的近义词中概率差值小于第二阈值的词删除并选择概率最大的预设数量的词作为每一个类别对应的所述多个子标签。Among the remaining synonyms, the words whose probability difference is smaller than the second threshold are deleted, and a preset number of words with the highest probability are selected as the plurality of sub-labels corresponding to each category.
在一些实施例中,所述根据所述已训练模型输出的所述剩余的近义词的概率将所述剩余的近义词中概率低于第一阈值的词删除,进一步包括:In some embodiments, according to the probability of the remaining synonyms output by the trained model, deleting the words whose probability is lower than the first threshold among the remaining synonyms further includes:
据所述已训练模型输出的所述剩余的近义词的概率将所述剩余的近义词中概率低于平均值的近义词划分为稀有词,并删除所述稀有词。According to the probabilities of the remaining synonyms output by the trained model, among the remaining synonyms, synonyms whose probability is lower than the average value are classified as rare words, and the rare words are deleted.
在一些实施例中,所述将所述多个近义词中不存在于所述已训练模型的词表中词删除,进一步包括:In some embodiments, the deleting of the plurality of synonyms that do not exist in the vocabulary of the trained model further includes:
采用遍历方式查看所述多个近义词中每个近义词是否在所述已训练模型的词表空间,并删除不在所述词表空间内的近义词。Checking whether each of the plurality of synonyms is in the vocabulary space of the trained model in a traversal manner, and deleting synonyms that are not in the vocabulary space.
在一些实施例中,所述将再次剩余的近义词中概率差值小于第二阈值的词,进一步包括:In some embodiments, the words whose probability difference is smaller than the second threshold among the remaining synonyms further include:
获取再次剩余的近义词中的同义词,删除同义词中除概率最大的同义词外的其它同义词。Obtain the synonyms among the remaining synonyms, and delete the synonyms except the one with the highest probability.
在一些实施例中,所述选择概率最大的预设数量的词作为每一个类别对应的所述多个子标签,进一步包括:In some embodiments, the selection of the preset number of words with the highest probability as the plurality of subtags corresponding to each category further includes:
按照概率从大到小的顺序对再次剩余的近义词中概率差值小于第二阈值的词进行排序,并选择排序在前预设数量的词作为每一个类别对应的所述多个子标签。Rank the words whose probability difference is smaller than the second threshold among the remaining synonyms in descending order of probability, and select a preset number of words ranked first as the plurality of sub-labels corresponding to each category.
在一些实施例中,所述预设模型为word2vec模型。In some embodiments, the preset model is a word2vec model.
在一些实施例中,利用每一个类别对应的所述多个子标签的概率和所述校正矩阵确定所述待分类数据最终的类别,进一步包括:In some embodiments, using the probability of the plurality of sub-labels corresponding to each category and the correction matrix to determine the final category of the data to be classified further includes:
按类别计算每一个类别对应的多个子标签的概率的平均值并将每一个类别对应的平均值乘以所述校正矩阵后作为校正后的第一概率,并将每一个类别的第一概率中的最大值作为数据的分类类别。Calculate the average value of the probabilities of multiple sub-labels corresponding to each category by category and multiply the average value corresponding to each category by the correction matrix as the corrected first probability, and add the first probability of each category to The maximum value of is used as the classification category of the data.
在一些实施例中,利用每一个类别对应的所述多个子标签的概率和所述校正矩阵确定所述待分类数据最终的类别,进一步包括:In some embodiments, using the probability of the plurality of sub-labels corresponding to each category and the correction matrix to determine the final category of the data to be classified further includes:
将每一个类别对应的多个子标签的概率中的最大值乘以所述校正矩阵后作为校正后的第二概率,并将概率最大的子标签对应的类别作为数据的第二分类类别。The maximum value among the probabilities of multiple sub-labels corresponding to each category is multiplied by the correction matrix as the corrected second probability, and the category corresponding to the sub-label with the highest probability is used as the second classification category of the data.
在一些实施例中,利用每一个类别对应的所述多个子标签的概率和所述校正矩阵确定所述待分类数据最终的类别,进一步包括:In some embodiments, using the probability of the plurality of sub-labels corresponding to each category and the correction matrix to determine the final category of the data to be classified further includes:
将每一个类别对应的多个子标签的概率分别乘以所述校正矩阵后按类别取平均值以作为校正后的第三概率,并将每一个类别的第三概率中的最大值作为 数据的第三分类类别。Multiply the probabilities of multiple sub-labels corresponding to each category by the correction matrix and take the average value by category as the corrected third probability, and use the maximum value of the third probability of each category as the third probability of the data Three classification categories.
在一些实施例中,所述预训练模型为PLM模型。In some embodiments, the pre-trained model is a PLM model.
基于同一发明构思,根据本申请的另一个方面,本申请的实施例还提供了一种分类结果校正系统,包括:Based on the same inventive concept, according to another aspect of the present application, an embodiment of the present application also provides a classification result correction system, including:
构建模块,配置为构建数据集并对所述数据集中的每一个数据标注一个对应类别的分类标签;A building module configured to construct a data set and mark each data in the data set with a classification label of a corresponding category;
计算模块,配置为将所述数据集中的每一个数据输入到已训练模型中以得到对应的所述分类标签的概率并利用每一个数据对应的所述分类标签概率计算校正矩阵;A calculation module configured to input each data in the data set into the trained model to obtain the probability of the corresponding classification label and calculate a correction matrix using the classification label probability corresponding to each data;
扩展模块,配置为将每一个类别的所述分类标签扩展为多个子标签;An expansion module configured to expand the classification label of each category into a plurality of sub-labels;
调整模块,配置为将所述已训练模型的输出调整为每一个类别对应的多个子标签的概率;An adjustment module configured to adjust the output of the trained model to the probability of multiple sub-labels corresponding to each category;
输入模块,配置为将待分类的数据输入到已训练模型中以得到每一个类别对应的多个子标签的概率;The input module is configured to input the data to be classified into the trained model to obtain the probability of a plurality of sub-labels corresponding to each category;
校正模块,配置为利用每一个类别对应的所述多个子标签的概率和所述校正矩阵确定所述待分类数据最终的类别。The correction module is configured to determine the final category of the data to be classified by using the probabilities of the plurality of sub-labels corresponding to each category and the correction matrix.
基于同一发明构思,根据本申请的另一个方面,本申请的实施例还提供了一种计算机设备,包括:Based on the same inventive concept, according to another aspect of the present application, an embodiment of the present application also provides a computer device, including:
至少一个处理器;以及at least one processor; and
存储器,所述存储器存储有可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述程序时执行如上所述的任一种分类结果校正方法的步骤。A memory, the memory stores a computer program that can run on the processor, wherein when the processor executes the program, it executes the steps of any classification result correction method described above.
基于同一发明构思,根据本申请的另一个方面,本申请的实施例还提供了一种非易失性可读存储介质,所述非易失性可读存储介质存储有计算机程序,所述计算机程序被处理器执行时执行如上所述的任一种分类结果校正方法的步骤。Based on the same inventive concept, according to another aspect of the present application, the embodiment of the present application also provides a non-volatile readable storage medium, the non-volatile readable storage medium stores a computer program, and the computer When the program is executed by the processor, the steps of any one of the classification result correction methods described above are executed.
基于同一发明构思,根据本申请的另一个方面,本申请的实施例还提供了一种计算处理设备,包括:Based on the same inventive concept, according to another aspect of the present application, an embodiment of the present application also provides a computing processing device, including:
存储器,其中存储有计算机可读代码;a memory having computer readable code stored therein;
一个或多个处理器,当所述计算机可读代码被所述一个或多个处理器执行 时,所述计算处理设备执行上述任意一项所述的分类结果校正方法的步骤。One or more processors, when the computer readable code is executed by the one or more processors, the computing processing device executes the steps of any one of the classification result correction methods described above.
基于同一发明构思,根据本申请的另一个方面,本申请的实施例还提供了一种计算机程序产品,包括计算机可读代码,当所述计算机可读代码在计算处理设备上运行时,导致所述计算处理设备执行根据上述任意一项所述的分类结果校正方法的步骤。Based on the same inventive concept, according to another aspect of the present application, an embodiment of the present application further provides a computer program product, including computer readable codes, when the computer readable codes run on a computing processing device, cause the The computing processing device executes the steps of the method for correcting classification results according to any one of the above.
本申请具有以下有益技术效果之一:本申请提出的方案通过将标签进行扩展,从而消除了预训练语料中标签出现频率不同而带来偏置,并且通过训练集样本替代空文本,同时校正标签词和输入样本带来的偏置。The present application has one of the following beneficial technical effects: the scheme proposed by the present application eliminates the bias caused by the different frequency of occurrence of labels in the pre-training corpus by extending the label, and replaces the empty text with the training set sample, and corrects the label at the same time Bias brought by words and input samples.
附图说明Description of drawings
为了更清楚地说明本申请的一些实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的实施例。In order to more clearly illustrate some embodiments of the present application or technical solutions in the prior art, the following will briefly introduce the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description The drawings are only some embodiments of the present application, and those skilled in the art can obtain other embodiments according to these drawings without any creative effort.
图1为本申请的实施例提供的分类结果校正方法的流程示意图;Fig. 1 is a schematic flow chart of the classification result correction method provided by the embodiment of the present application;
图2为本申请的实施例提供的标签扩展流程图;FIG. 2 is a flow chart of tag extension provided by an embodiment of the present application;
图3为本申请的实施例提供的分类结果校正系统的结构示意图;FIG. 3 is a schematic structural diagram of a classification result correction system provided by an embodiment of the present application;
图4为本申请的实施例提供的计算机设备的结构示意图;FIG. 4 is a schematic structural diagram of a computer device provided by an embodiment of the present application;
图5为本申请的实施例提供的非易失性可读存储介质的结构示意图;FIG. 5 is a schematic structural diagram of a non-volatile readable storage medium provided by an embodiment of the present application;
图6示意性地示出了用于执行根据本申请的方法的计算处理设备的框图;以及Figure 6 schematically illustrates a block diagram of a computing processing device for performing a method according to the present application; and
图7示意性地示出了用于保持或者携带实现根据本申请的方法的程序代码的存储单元。Fig. 7 schematically shows a storage unit for holding or carrying program codes for realizing the method according to the present application.
具体实施方式Detailed ways
为使本申请的目的、技术方案和优点更加清楚明白,以下结合具体实施例,并参照附图,对本申请实施例进一步详细说明。In order to make the purpose, technical solution and advantages of the present application clearer, the embodiments of the present application will be further described in detail below in combination with specific embodiments and with reference to the accompanying drawings.
需要说明的是,本申请实施例中所有使用“第一”和“第二”的表述均是为了区分两个相同名称非相同的实体或者非相同的参量,可见“第一”“第二”仅为了表述的方便,不应理解为对本申请实施例的限定,后续实施例对此不再一一说 明。It should be noted that all expressions using "first" and "second" in the embodiments of this application are to distinguish between two entities with the same name but different parameters or parameters that are not the same, see "first" and "second" It is only for the convenience of expression, and should not be construed as a limitation on the embodiments of the present application, which will not be described one by one in the subsequent embodiments.
根据本申请的一个方面,本申请的实施例提出一种分类结果校正方法,如图1所示,其可以包括步骤:According to one aspect of the present application, an embodiment of the present application proposes a classification result correction method, as shown in Figure 1, which may include steps:
S1,构建数据集并对所述数据集中的每一个数据标注一个对应类别的分类标签;S1, constructing a data set and marking each data in the data set with a classification label of a corresponding category;
S2,将所述数据集中的每一个数据输入到已训练模型中以得到对应的所述分类标签的概率并利用每一个数据对应的所述分类标签概率计算校正矩阵;S2. Input each data in the data set into the trained model to obtain the probability of the corresponding classification label and calculate a correction matrix using the classification label probability corresponding to each data;
S3,将每一个类别的所述分类标签扩展为多个子标签;S3, expanding the classification label of each category into multiple sub-labels;
S4,将所述已训练模型的输出调整为每一个类别对应的多个子标签的概率;S4, adjusting the output of the trained model to the probability of multiple sub-labels corresponding to each category;
S5,将待分类的数据输入到已训练模型中以得到每一个类别对应的多个子标签的概率;S5, input the data to be classified into the trained model to obtain the probability of multiple sub-labels corresponding to each category;
S6,利用每一个类别对应的所述多个子标签的概率和所述校正矩阵确定所述待分类数据最终的类别。S6. Determine the final category of the data to be classified by using the probabilities of the plurality of sub-labels corresponding to each category and the correction matrix.
本申请提出的方案通过将标签进行扩展,从而消除了预训练语料中标签出现频率不同而带来偏置,并且通过训练集样本替代空文本,同时校正标签词和输入样本带来的偏置。The scheme proposed in this application eliminates the bias caused by the different frequencies of tags in the pre-training corpus by expanding the tags, and replaces the empty text with training set samples, while correcting the bias caused by tag words and input samples.
一些实施例中,步骤S2,将数据集中的每一个数据输入到已训练模型中以得到对应的分类标签的概率并利用每一个数据对应的分类标签概率计算校正矩阵,进一步包括:In some embodiments, step S2, input each data in the data set into the trained model to obtain the probability of the corresponding classification label and use the classification label probability corresponding to each data to calculate the correction matrix, further comprising:
将每一个数据对应的分类标签的概率按类别求和取均值以得到每一个类别对应的概率;The probability of the classification label corresponding to each data is summed and averaged by category to obtain the probability corresponding to each category;
对每一个类别对应的概率进行归一化处理后构建对角矩阵;After normalizing the probability corresponding to each category, a diagonal matrix is constructed;
将对角矩阵求逆后得到校正矩阵。The correction matrix is obtained after inverting the diagonal matrix.
具体的,将训练集中的每一个数据在输入到模型(PLM(Pre-trained Language Model),预训练语言模型)中后可以得到对应类别的标签概率,然后将同类别的所有标签概率进行求和取均值,并归一化处理后构建对角矩阵,然后在求取对角矩阵的逆矩阵即可得到最终的校正矩阵。Specifically, after each data in the training set is input into the model (PLM (Pre-trained Language Model), pre-trained language model), the label probability of the corresponding category can be obtained, and then all label probabilities of the same category are summed Take the mean value and normalize it to construct a diagonal matrix, and then calculate the inverse matrix of the diagonal matrix to obtain the final correction matrix.
例如,将数据A输入到模型中后,可以得到数据A的标签概率,该标签对应于第一类别,这样将所有第一类别的标签概率求和取均值后即可得到第一类 别对应的概率,然后将所有类别对应的概率进行归一化处理,接着构建对角矩阵并求取对角矩阵的逆矩阵,以得到最终的校正矩阵。For example, after data A is input into the model, the label probability of data A can be obtained, and the label corresponds to the first category. In this way, the probability corresponding to the first category can be obtained by summing the label probabilities of all the first category and taking the mean value , and then normalize the probabilities corresponding to all categories, and then construct a diagonal matrix and obtain the inverse matrix of the diagonal matrix to obtain the final correction matrix.
在一些实施例中,S3,将每一个类别的分类标签扩展为多个子标签,进一步包括:In some embodiments, S3, expanding the classification label of each category into multiple sub-labels, further comprising:
利用预设模型获取每一个类别的分类标签对应的多个近义词;Use the preset model to obtain multiple synonyms corresponding to the classification labels of each category;
分别从每一个类别对应的多个近义词中筛选预设数量的词作为每一个类别对应的多个子标签。A preset number of words are respectively screened from multiple synonyms corresponding to each category as multiple sub-tags corresponding to each category.
在一些实施例中,分别从每一个类别对应的多个近义词中筛选预设数量的词作为每一个类别对应的多个子标签,进一步包括:In some embodiments, selecting a preset number of words from a plurality of synonyms corresponding to each category as a plurality of subtags corresponding to each category further includes:
将多个近义词中不存在于已训练模型的词表中词删除;Delete words that do not exist in the vocabulary of the trained model among multiple synonyms;
将已训练模型的输出调整为剩余的近义词的概率;Adjust the output of the trained model to the probabilities of the remaining synonyms;
将数据集中的每一个数据输入到已训练模型中以得到剩余的近义词的概率;Input each data in the data set into the trained model to obtain the probability of the remaining synonyms;
根据已训练模型输出的剩余的近义词的概率将剩余的近义词中概率低于第一阈值的词删除;Delete the words whose probability is lower than the first threshold in the remaining synonyms according to the probability of the remaining synonyms output by the trained model;
将再次剩余的近义词中概率差值小于第二阈值的词删除并选择概率最大的预设数量的词作为每一个类别对应的多个子标签。Among the remaining synonyms, the words whose probability difference is smaller than the second threshold are deleted, and a preset number of words with the highest probability are selected as multiple sub-labels corresponding to each category.
具体的,可以通过word2vec模型输出近义词来构建扩展标签映射词表。对于近义词的选择,可以使用了一个涵盖超过800万个中文单词和短语的嵌入语料库,用于训练word2vec模型,以揭示词语之间的相关性,然后再对word2vec模型输出的多个近义词进行筛选得到最终的多个子标签。Specifically, the extended label mapping vocabulary can be constructed by outputting synonyms through the word2vec model. For the selection of synonyms, an embedding corpus covering more than 8 million Chinese words and phrases can be used to train the word2vec model to reveal the correlation between words, and then filter multiple synonyms output by the word2vec model to obtain The final multiple subtags.
如图2所示,可以先采用遍历方式查看每个近义词是否在模型的词表空间,删除不在其中的标签映射词。As shown in Figure 2, you can first use the traversal method to check whether each synonym is in the vocabulary space of the model, and delete the label mapping words that are not in it.
需要说明的是,词表空间中的每一个词均可以作为标签。将一个数据输入到模型中,模型可以输出词表空间中每一个词的概率,但是由于词表空间中词较多,可以根据实际需求调整模型输出的标签。It should be noted that each word in the vocabulary space can be used as a label. When a piece of data is input into the model, the model can output the probability of each word in the vocabulary space, but since there are many words in the vocabulary space, the label output by the model can be adjusted according to actual needs.
接着可以将训练集中的每一个数据输入到模型中,得到剩余的近义词的概率,并将那些概率低于平均值的近义词划分为稀有词,这些稀有词会导致预测的概率不准确,故需要删去。Then you can input each data in the training set into the model to get the probability of the remaining synonyms, and divide those synonyms whose probability is lower than the average into rare words. These rare words will lead to inaccurate predicted probability, so it needs to be deleted. go.
然后由于模型预测的同义词之间的概率非常相近,会使标签扩充失去意 义,因此可以将概率值相近的同义词删除,保留同义词中预测概率最高者,其余删除。Then, because the probabilities of the synonyms predicted by the model are very similar, the label expansion will be meaningless, so the synonyms with similar probability values can be deleted, and the one with the highest predicted probability among the synonyms can be retained, and the rest can be deleted.
最后选择剩余词表中前N个作为最终扩展的标签映射词表,例如N=min(5,筛选后标签映射词数目)。Finally, the first N of the remaining vocabularies are selected as the final extended tag mapping vocabulary, for example, N=min(5, the number of tag mapping words after screening).
通过以上流程每个标签扩展为N个近义词,从而消除单个标签带来的偏差。Through the above process, each tag is expanded into N synonyms, thereby eliminating the bias caused by a single tag.
例如,在对标签进行扩充后,第一类别(例如正面)的标签“好评”扩展为["好"、"正面"、"满意"、"出色"、"棒"],第二类别(例如反面)的标签“差评”扩展为["差"、"负面"、"失望"、"不佳"、"糟糕"]。For example, after augmenting the labels, the label "Praise" for the first category (eg Positive) is expanded to ["Good", "Positive", "Satisfied", "Excellent", "Great"], and for the second category (eg The label "bad review" on the reverse side) expands to ["bad", "negative", "disappointed", "poor", "bad"].
在理想状态下,所有标签在预训练语料中的出现频率应该大致相同。但是在实验中,发现标签在语料中出现的频率存在差异,使得模型对预测结果有偏好性。在实际应用中,人工从接近6万的词表空间中选择符合条件的标签映射词是非常困难的,而且通常会引入主观因素。因此采用上述筛选方式扩展标签映射词,Ideally, all labels should appear with roughly the same frequency in the pre-training corpus. However, in the experiment, it was found that the frequency of tags appearing in the corpus is different, which makes the model have a preference for the prediction results. In practical applications, it is very difficult to manually select qualified label mapping words from a vocabulary space of nearly 60,000, and subjective factors are usually introduced. Therefore, the above screening method is used to expand the label mapping words,
在一些实施例中,利用每一个类别对应的多个子标签的概率和校正矩阵确定待分类数据最终的类别,进一步包括:In some embodiments, using the probability and correction matrix of multiple sub-labels corresponding to each category to determine the final category of the data to be classified further includes:
按类别计算每一个类别对应的多个子标签的概率的平均值并将每一个类别对应的平均值乘以校正矩阵后作为校正后的第一概率,并将每一个类别的第一概率中的最大值作为数据的分类类别。Calculate the average of the probabilities of multiple sub-labels corresponding to each category by category and multiply the average value corresponding to each category by the correction matrix as the corrected first probability, and take the maximum of the first probability of each category Values as categorical categories for the data.
具体的,当得到校正矩阵后,可以将待分类的数据输入到模型中,此时模型的输出为每一个类别对应的多个标签的概率,然后按类别计算每一个类别对应的多个子标签的概率的平均值,再将每一个类别对应的平均值乘以校正矩阵后作为校正后的第一概率,并将每一个类别的第一概率中的最大值作为数据的分类类别。Specifically, after the correction matrix is obtained, the data to be classified can be input into the model. At this time, the output of the model is the probability of multiple labels corresponding to each category, and then the probability of multiple sublabels corresponding to each category is calculated by category. The average value of the probability, and then the average value corresponding to each category is multiplied by the correction matrix as the corrected first probability, and the maximum value of the first probability of each category is used as the classification category of the data.
例如,将待分类数据B输入到模型中后,模型可以输出10个子标签的概率,即["好"、"正面"、"满意"、"出色"、"棒"]的概率以及["差"、"负面"、"失望"、"不佳"、"糟糕"]的概率,然后将["好"、"正面"、"满意"、"出色"、"棒"]的概率求取平均值,["差"、"负面"、"失望"、"不佳"、"糟糕"]的概率求取平均值,接着分别将平均值乘以校正矩阵,最后将概率值最大的平均值对应的类别作为待分类数据B的最终分类类别。For example, after the data to be classified B is input into the model, the model can output the probabilities of 10 sub-labels, that is, the probabilities of ["good", "positive", "satisfied", "excellent", "rod"] and ["poor ", "Negative", "Disappointed", "Poor", "Bad"] and average the probabilities of ["Good", "Positive", "Satisfied", "Excellent", "Great"] Value, the probability of ["bad", "negative", "disappointed", "bad", "bad"] is averaged, then the average value is multiplied by the correction matrix, and finally the average value with the largest probability value corresponds to The category of is used as the final classification category of the data B to be classified.
在一些实施例中,利用每一个类别对应的多个子标签的概率和校正矩阵确定待分类数据最终的类别,进一步包括:In some embodiments, using the probability and correction matrix of multiple sub-labels corresponding to each category to determine the final category of the data to be classified further includes:
将每一个类别对应的多个子标签的概率中的最大值乘以校正矩阵后作为校正后的第二概率,并将概率最大的子标签对应的类别作为数据的第二分类类别。The maximum value among the probabilities of multiple sub-labels corresponding to each category is multiplied by the correction matrix as the corrected second probability, and the category corresponding to the sub-label with the highest probability is used as the second classification category of the data.
具体的,当得到校正矩阵后,可以将待分类的数据输入到模型中,此时模型的输出为每一个类别对应的多个标签的概率,然后将每一个类别对应的多个子标签的概率中的最大值乘以校正矩阵后作为校正后的第二概率,并将概率最大的子标签对应的类别作为数据的第二分类类别。Specifically, after the correction matrix is obtained, the data to be classified can be input into the model. At this time, the output of the model is the probability of multiple labels corresponding to each category, and then the probability of multiple sublabels corresponding to each category is included in The maximum value of is multiplied by the correction matrix as the corrected second probability, and the category corresponding to the sublabel with the highest probability is used as the second classification category of the data.
例如,将待分类数据B输入到模型中后,模型可以输出10个子标签的概率,即["好"、"正面"、"满意"、"出色"、"棒"]的概率以及["差"、"负面"、"失望"、"不佳"、"糟糕"]的概率,然后将["好"、"正面"、"满意"、"出色"、"棒"]的概率中的最大值乘以校正矩阵,["差"、"负面"、"失望"、"不佳"、"糟糕"]的概率中的最大值乘以校正矩阵,最后将概率值大的对应的类别作为待分类数据B的最终分类类别。For example, after the data to be classified B is input into the model, the model can output the probabilities of 10 sub-labels, that is, the probabilities of ["good", "positive", "satisfied", "excellent", "rod"] and ["poor ", "Negative", "Disappointed", "Poor", "Bad"], then the probability of ["Good", "Positive", "Satisfied", "Excellent", "Great"] is the largest The value is multiplied by the correction matrix, the maximum value of the probability of ["poor", "negative", "disappointed", "bad", "bad"] is multiplied by the correction matrix, and finally the corresponding category with a large probability value is used as the target Final categorical category for categorical data B.
在一些实施例中,利用每一个类别对应的多个子标签的概率和校正矩阵确定待分类数据最终的类别,进一步包括:In some embodiments, using the probability and correction matrix of multiple sub-labels corresponding to each category to determine the final category of the data to be classified further includes:
将每一个类别对应的多个子标签的概率分别乘以校正矩阵后按类别取平均值以作为校正后的第三概率,并将每一个类别的第三概率中的最大值作为数据的第三分类类别。Multiply the probabilities of multiple sub-labels corresponding to each category by the correction matrix and take the average value by category as the corrected third probability, and use the maximum value of the third probability of each category as the third classification of the data category.
具体的,当得到校正矩阵后,可以将待分类的数据输入到模型中,此时模型的输出为每一个类别对应的多个标签的概率,然后将每一个类别对应的多个子标签的概率分别乘以校正矩阵后按类别取平均值以作为校正后的第三概率,并将每一个类别的第三概率中的最大值作为数据的第三分类类别。Specifically, after the correction matrix is obtained, the data to be classified can be input into the model. At this time, the output of the model is the probability of multiple labels corresponding to each category, and then the probabilities of multiple sub-labels corresponding to each category are respectively After multiplying by the correction matrix, take the average value by category as the corrected third probability, and use the maximum value of the third probability of each category as the third classification category of the data.
例如,将待分类数据B输入到模型中后,模型可以输出10个子标签的概率,即["好"、"正面"、"满意"、"出色"、"棒"]的概率以及["差"、"负面"、"失望"、"不佳"、"糟糕"]的概率,然后将["好"、"正面"、"满意"、"出色"、"棒"]的概率分别乘以校正矩阵后求取平均值,["差"、"负面"、"失望"、"不佳"、"糟糕"]的概率分别乘以校正矩阵后求取平均值,最后将平均值大的对应的类别作为待分类数据B的最终分类类别。For example, after the data to be classified B is input into the model, the model can output the probabilities of 10 sub-labels, that is, the probabilities of ["good", "positive", "satisfied", "excellent", "rod"] and ["poor ", "Negative", "Disappointed", "Poor", "Bad"], then multiply the probabilities of ["Good", "Positive", "Satisfied", "Excellent", "Great"] by Calculate the average value after correcting the matrix, multiply the probabilities of ["poor", "negative", "disappointed", "bad", "bad"] by the correction matrix respectively, and calculate the average value, and finally the corresponding The category of is used as the final classification category of the data B to be classified.
在一些实施例中,还可以通过将训练集中的数据利用上述3种校正方法进行校正,然后每一种校正方法得到的结果通过与训练集标签对比,精度最高者对应最佳校正优化方案,然后利用最佳校正方案进行校正。In some embodiments, the data in the training set can also be corrected using the above three correction methods, and then the results obtained by each correction method are compared with the labels of the training set, and the one with the highest accuracy corresponds to the best correction optimization scheme, and then Calibrate using the best calibration scheme.
本申请采用标签扩充和校正相结合的校正优化方案,不仅消除预训练语料中标签出现频率不同而带来偏置,而且通过训练集样本替代空文本校正标签词和输入样本带来的偏置,将该方法可使巨量模型避免再训练,而且大大提高下游任务准确率和稳定性。本申请将提出的优化方案应用于CLUE中文分类数据集,加载预训练好的千亿参数模型“源1.0”,经试验新闻分类准确率能提高5个百分点(校正前52.09%,校正后57.47%),科学文献学科分类准确率能提高7个百分点(校正前39.02%,校正后46.57%),应用描述长文本分类准确率能提高4个百分点(校正前34.89%,校正后38.82%),电商产品情感分类准确率能提高35个百分点(校正前51.25%,校正后86.88%)。This application uses a correction optimization scheme that combines label expansion and correction, which not only eliminates the bias caused by the different frequency of labels in the pre-training corpus, but also corrects the bias caused by label words and input samples by replacing empty text with training set samples. This method can avoid retraining of huge models, and greatly improve the accuracy and stability of downstream tasks. This application applies the proposed optimization scheme to the CLUE Chinese classification data set, and loads the pre-trained 100 billion parameter model "Source 1.0". After the test, the news classification accuracy rate can be increased by 5 percentage points (52.09% before correction and 57.47% after correction) ), the subject classification accuracy rate of scientific literature can be increased by 7 percentage points (39.02% before correction, 46.57% after correction), and the classification accuracy rate of application description long text can be increased by 4 percentage points (34.89% before correction, 38.82% after correction). The accuracy rate of product emotion classification can be increased by 35 percentage points (51.25% before correction and 86.88% after correction).
目前已有的校正方法通过无文本只能校正模型对标签的偏置,无法校正输入样本带来的偏置。即已有方法把所有类别都校正至无偏状态,但是数据集中类别分布是不一致的。而在本申请中采用训练集样本代替空文本优化校正算法,使得模型可以根据数据分布进行校正。同时通过对标签进行扩展,从而消除了预训练语料中标签出现频率不同而带来偏置。The existing correction methods can only correct the bias of the model to the label through no text, but cannot correct the bias brought by the input samples. That is, the existing method corrects all categories to an unbiased state, but the category distribution in the data set is inconsistent. However, in this application, the training set samples are used to replace the empty text optimization correction algorithm, so that the model can be corrected according to the data distribution. At the same time, by expanding the labels, the bias caused by the different frequencies of labels in the pre-training corpus is eliminated.
基于同一发明构思,根据本申请的另一个方面,本申请的实施例还提供了一种分类结果校正系统400,如图3所示,包括:Based on the same inventive concept, according to another aspect of the present application, an embodiment of the present application also provides a classification result correction system 400, as shown in FIG. 3 , including:
构建模块401,配置为构建数据集并对所述数据集中的每一个数据标注一个对应类别的分类标签;The construction module 401 is configured to construct a data set and mark each data in the data set with a classification label of a corresponding category;
计算模块402,配置为将所述数据集中的每一个数据输入到已训练模型中以得到对应的所述分类标签的概率并利用每一个数据对应的所述分类标签概率计算校正矩阵;The calculation module 402 is configured to input each data in the data set into the trained model to obtain the probability of the corresponding classification label and calculate a correction matrix using the classification label probability corresponding to each data;
扩展模块403,配置为将每一个类别的所述分类标签扩展为多个子标签;The expansion module 403 is configured to expand the classification label of each category into multiple sub-labels;
调整模块404,配置为将所述已训练模型的输出调整为每一个类别对应的多个子标签的概率;An adjustment module 404 configured to adjust the output of the trained model to the probability of multiple sub-labels corresponding to each category;
输入模块405,配置为将待分类的数据输入到已训练模型中以得到每一个类别对应的多个子标签的概率;The input module 405 is configured to input the data to be classified into the trained model to obtain the probability of multiple sub-labels corresponding to each category;
校正模块406,配置为利用每一个类别对应的所述多个子标签的概率和所 述校正矩阵确定所述待分类数据最终的类别。The correction module 406 is configured to use the probability of the plurality of sub-labels corresponding to each category and the correction matrix to determine the final category of the data to be classified.
在一些实施例中,计算模块402还配置为:In some embodiments, computing module 402 is further configured to:
将每一个数据对应的分类标签的概率按类别求和取均值以得到每一个类别对应的概率;The probability of the classification label corresponding to each data is summed and averaged by category to obtain the probability corresponding to each category;
对每一个类别对应的概率进行归一化处理后构建对角矩阵;After normalizing the probability corresponding to each category, a diagonal matrix is constructed;
将所述对角矩阵求逆后得到所述校正矩阵。The correction matrix is obtained after inverting the diagonal matrix.
在一些实施例中,扩展模块403还配置为:In some embodiments, the extension module 403 is further configured to:
利用预设模型获取每一个类别的所述分类标签对应的多个近义词;Using a preset model to obtain a plurality of synonyms corresponding to the classification labels of each category;
分别从每一个类别对应的所述多个近义词中筛选预设数量的词作为每一个类别对应的所述多个子标签。Respectively select a preset number of words from the plurality of synonyms corresponding to each category as the plurality of subtags corresponding to each category.
在一些实施例中,扩展模块403还配置为:In some embodiments, the extension module 403 is further configured to:
将所述多个近义词中不存在于所述已训练模型的词表中词删除;Deleting words that do not exist in the vocabulary of the trained model in the plurality of synonyms;
将所述已训练模型的输出调整为剩余的近义词的概率;adjusting the output of the trained model to the probabilities of the remaining synonyms;
将所述数据集中的每一个数据输入到已训练模型中以得到所述剩余的近义词的概率;Input each data in the data set into the trained model to obtain the probability of the remaining synonyms;
根据所述已训练模型输出的所述剩余的近义词的概率将所述剩余的近义词中概率低于第一阈值的词删除;deleting the words whose probability is lower than the first threshold among the remaining synonyms according to the probability of the remaining synonyms output by the trained model;
将再次剩余的近义词中概率差值小于第二阈值的词删除并选择概率最大的预设数量的词作为每一个类别对应的所述多个子标签。Among the remaining synonyms, the words whose probability difference is smaller than the second threshold are deleted, and a preset number of words with the highest probability are selected as the plurality of sub-labels corresponding to each category.
在一些实施例中,校正模块406还配置为:In some embodiments, the correction module 406 is further configured to:
按类别计算每一个类别对应的多个子标签的概率的平均值并将每一个类别对应的平均值乘以所述校正矩阵后作为校正后的第一概率,并将每一个类别的第一概率中的最大值作为数据的分类类别。Calculate the average value of the probabilities of multiple sub-labels corresponding to each category by category and multiply the average value corresponding to each category by the correction matrix as the corrected first probability, and add the first probability of each category to The maximum value of is used as the classification category of the data.
在一些实施例中,校正模块406还配置为:In some embodiments, the correction module 406 is further configured to:
将每一个类别对应的多个子标签的概率中的最大值乘以所述校正矩阵后作为校正后的第二概率,并将概率最大的子标签对应的类别作为数据的第二分类类别。The maximum value among the probabilities of multiple sub-labels corresponding to each category is multiplied by the correction matrix as the corrected second probability, and the category corresponding to the sub-label with the highest probability is used as the second classification category of the data.
在一些实施例中,校正模块406还配置为:In some embodiments, the correction module 406 is further configured to:
将每一个类别对应的多个子标签的概率分别乘以所述校正矩阵后按类别取平均值以作为校正后的第三概率,并将每一个类别的第三概率中的最大值作为 数据的第三分类类别。Multiply the probabilities of multiple sub-labels corresponding to each category by the correction matrix and take the average value by category as the corrected third probability, and use the maximum value of the third probability of each category as the third probability of the data Three classification categories.
基于同一发明构思,根据本申请的另一个方面,如图4所示,本申请的实施例还提供了一种计算机设备501,包括:Based on the same inventive concept, according to another aspect of the present application, as shown in FIG. 4 , an embodiment of the present application also provides a computer device 501, including:
至少一个处理器520;以及at least one processor 520; and
存储器510,存储器510存储有可在处理器上运行的计算机程序511,处理器520执行程序时执行以下步骤: Memory 510, the memory 510 stores a computer program 511 that can run on the processor, and the processor 520 performs the following steps when executing the program:
S1,构建数据集并对所述数据集中的每一个数据标注一个对应类别的分类标签;S1, constructing a data set and marking each data in the data set with a classification label of a corresponding category;
S2,将所述数据集中的每一个数据输入到已训练模型中以得到对应的所述分类标签的概率并利用每一个数据对应的所述分类标签概率计算校正矩阵;S2. Input each data in the data set into the trained model to obtain the probability of the corresponding classification label and calculate a correction matrix using the classification label probability corresponding to each data;
S3,将每一个类别的所述分类标签扩展为多个子标签;S3, expanding the classification label of each category into multiple sub-labels;
S4,将所述已训练模型的输出调整为每一个类别对应的多个子标签的概率;S4, adjusting the output of the trained model to the probability of multiple sub-labels corresponding to each category;
S5,将待分类的数据输入到已训练模型中以得到每一个类别对应的多个子标签的概率;S5, input the data to be classified into the trained model to obtain the probability of multiple sub-labels corresponding to each category;
S6,利用每一个类别对应的所述多个子标签的概率和所述校正矩阵确定所述待分类数据最终的类别。S6. Determine the final category of the data to be classified by using the probabilities of the plurality of sub-labels corresponding to each category and the correction matrix.
在一些实施例中,将所述数据集中的每一个数据输入到已训练模型中以得到对应的所述分类标签的概率并利用每一个数据对应的所述分类标签概率计算校正矩阵,进一步包括:In some embodiments, each data in the data set is input into the trained model to obtain the probability of the corresponding classification label and the correction matrix is calculated using the classification label probability corresponding to each data, further comprising:
将每一个数据对应的分类标签的概率按类别求和取均值以得到每一个类别对应的概率;The probability of the classification label corresponding to each data is summed and averaged by category to obtain the probability corresponding to each category;
对每一个类别对应的概率进行归一化处理后构建对角矩阵;After normalizing the probability corresponding to each category, a diagonal matrix is constructed;
将所述对角矩阵求逆后得到所述校正矩阵。The correction matrix is obtained after inverting the diagonal matrix.
在一些实施例中,将每一个类别的所述分类标签扩展为多个子标签,进一步包括:In some embodiments, expanding the classification label of each category into a plurality of sub-labels, further comprising:
利用预设模型获取每一个类别的所述分类标签对应的多个近义词;Using a preset model to obtain a plurality of synonyms corresponding to the classification labels of each category;
分别从每一个类别对应的所述多个近义词中筛选预设数量的词作为每一个类别对应的所述多个子标签。Respectively select a preset number of words from the plurality of synonyms corresponding to each category as the plurality of subtags corresponding to each category.
在一些实施例中,分别从每一个类别对应的所述多个近义词中筛选预设数 量的词作为每一个类别对应的所述多个子标签,进一步包括:In some embodiments, screening a preset number of words from the plurality of synonyms corresponding to each category as the plurality of subtags corresponding to each category, further comprising:
将所述多个近义词中不存在于所述已训练模型的词表中词删除;Deleting words that do not exist in the vocabulary of the trained model in the plurality of synonyms;
将所述已训练模型的输出调整为剩余的近义词的概率;adjusting the output of the trained model to the probabilities of the remaining synonyms;
将所述数据集中的每一个数据输入到已训练模型中以得到所述剩余的近义词的概率;Input each data in the data set into the trained model to obtain the probability of the remaining synonyms;
根据所述已训练模型输出的所述剩余的近义词的概率将所述剩余的近义词中概率低于第一阈值的词删除;deleting the words whose probability is lower than the first threshold among the remaining synonyms according to the probability of the remaining synonyms output by the trained model;
将再次剩余的近义词中概率差值小于第二阈值的词删除并选择概率最大的预设数量的词作为每一个类别对应的所述多个子标签。Among the remaining synonyms, the words whose probability difference is smaller than the second threshold are deleted, and a preset number of words with the highest probability are selected as the plurality of sub-labels corresponding to each category.
在一些实施例中,利用每一个类别对应的所述多个子标签的概率和所述校正矩阵确定所述待分类数据最终的类别,进一步包括:In some embodiments, using the probability of the plurality of sub-labels corresponding to each category and the correction matrix to determine the final category of the data to be classified further includes:
按类别计算每一个类别对应的多个子标签的概率的平均值并将每一个类别对应的平均值乘以所述校正矩阵后作为校正后的第一概率,并将每一个类别的第一概率中的最大值作为数据的分类类别。Calculate the average value of the probabilities of multiple sub-labels corresponding to each category by category and multiply the average value corresponding to each category by the correction matrix as the corrected first probability, and add the first probability of each category to The maximum value of is used as the classification category of the data.
在一些实施例中,利用每一个类别对应的所述多个子标签的概率和所述校正矩阵确定所述待分类数据最终的类别,进一步包括:In some embodiments, using the probability of the plurality of sub-labels corresponding to each category and the correction matrix to determine the final category of the data to be classified further includes:
将每一个类别对应的多个子标签的概率中的最大值乘以所述校正矩阵后作为校正后的第二概率,并将概率最大的子标签对应的类别作为数据的第二分类类别。The maximum value among the probabilities of multiple sub-labels corresponding to each category is multiplied by the correction matrix as the corrected second probability, and the category corresponding to the sub-label with the highest probability is used as the second classification category of the data.
在一些实施例中,利用每一个类别对应的所述多个子标签的概率和所述校正矩阵确定所述待分类数据最终的类别,进一步包括:In some embodiments, using the probability of the plurality of sub-labels corresponding to each category and the correction matrix to determine the final category of the data to be classified further includes:
将每一个类别对应的多个子标签的概率分别乘以所述校正矩阵后按类别取平均值以作为校正后的第三概率,并将每一个类别的第三概率中的最大值作为数据的第三分类类别。Multiply the probabilities of multiple sub-labels corresponding to each category by the correction matrix and take the average value by category as the corrected third probability, and use the maximum value of the third probability of each category as the third probability of the data Three classification categories.
基于同一发明构思,根据本申请的另一个方面,如图5所示,本申请的实施例还提供了一种非易失性可读存储介质601,非易失性可读存储介质601存储有计算机程序指令610,计算机程序指令610被处理器执行时执行以下步骤:Based on the same inventive concept, according to another aspect of the present application, as shown in FIG. 5 , the embodiment of the present application also provides a non-volatile readable storage medium 601, which stores Computer program instructions 610, the following steps are performed when the computer program instructions 610 are executed by the processor:
S1,构建数据集并对所述数据集中的每一个数据标注一个对应类别的分类标签;S1, constructing a data set and marking each data in the data set with a classification label of a corresponding category;
S2,将所述数据集中的每一个数据输入到已训练模型中以得到对应的所述 分类标签的概率并利用每一个数据对应的所述分类标签概率计算校正矩阵;S2, input each data in the data set into the trained model to obtain the probability of the corresponding classification label and use the classification label probability corresponding to each data to calculate a correction matrix;
S3,将每一个类别的所述分类标签扩展为多个子标签;S3, expanding the classification label of each category into multiple sub-labels;
S4,将所述已训练模型的输出调整为每一个类别对应的多个子标签的概率;S4, adjusting the output of the trained model to the probability of multiple sub-labels corresponding to each category;
S5,将待分类的数据输入到已训练模型中以得到每一个类别对应的多个子标签的概率;S5, input the data to be classified into the trained model to obtain the probability of multiple sub-labels corresponding to each category;
S6,利用每一个类别对应的所述多个子标签的概率和所述校正矩阵确定所述待分类数据最终的类别。S6. Determine the final category of the data to be classified by using the probabilities of the plurality of sub-labels corresponding to each category and the correction matrix.
在一些实施例中,将所述数据集中的每一个数据输入到已训练模型中以得到对应的所述分类标签的概率并利用每一个数据对应的所述分类标签概率计算校正矩阵,进一步包括:In some embodiments, each data in the data set is input into the trained model to obtain the probability of the corresponding classification label and the correction matrix is calculated using the classification label probability corresponding to each data, further comprising:
将每一个数据对应的分类标签的概率按类别求和取均值以得到每一个类别对应的概率;The probability of the classification label corresponding to each data is summed and averaged by category to obtain the probability corresponding to each category;
对每一个类别对应的概率进行归一化处理后构建对角矩阵;After normalizing the probability corresponding to each category, a diagonal matrix is constructed;
将所述对角矩阵求逆后得到所述校正矩阵。The correction matrix is obtained after inverting the diagonal matrix.
在一些实施例中,将每一个类别的所述分类标签扩展为多个子标签,进一步包括:In some embodiments, expanding the classification label of each category into a plurality of sub-labels, further comprising:
利用预设模型获取每一个类别的所述分类标签对应的多个近义词;Using a preset model to obtain a plurality of synonyms corresponding to the classification labels of each category;
分别从每一个类别对应的所述多个近义词中筛选预设数量的词作为每一个类别对应的所述多个子标签。Respectively select a preset number of words from the plurality of synonyms corresponding to each category as the plurality of subtags corresponding to each category.
在一些实施例中,分别从每一个类别对应的所述多个近义词中筛选预设数量的词作为每一个类别对应的所述多个子标签,进一步包括:In some embodiments, selecting a preset number of words from the plurality of synonyms corresponding to each category as the plurality of subtags corresponding to each category further includes:
将所述多个近义词中不存在于所述已训练模型的词表中词删除;Deleting words that do not exist in the vocabulary of the trained model in the plurality of synonyms;
将所述已训练模型的输出调整为剩余的近义词的概率;adjusting the output of the trained model to the probabilities of the remaining synonyms;
将所述数据集中的每一个数据输入到已训练模型中以得到所述剩余的近义词的概率;Input each data in the data set into the trained model to obtain the probability of the remaining synonyms;
根据所述已训练模型输出的所述剩余的近义词的概率将所述剩余的近义词中概率低于第一阈值的词删除;deleting the words whose probability is lower than the first threshold among the remaining synonyms according to the probability of the remaining synonyms output by the trained model;
将再次剩余的近义词中概率差值小于第二阈值的词删除并选择概率最大的预设数量的词作为每一个类别对应的所述多个子标签。Among the remaining synonyms, the words whose probability difference is smaller than the second threshold are deleted, and a preset number of words with the highest probability are selected as the plurality of sub-labels corresponding to each category.
在一些实施例中,利用每一个类别对应的所述多个子标签的概率和所述校正矩阵确定所述待分类数据最终的类别,进一步包括:In some embodiments, using the probability of the plurality of sub-labels corresponding to each category and the correction matrix to determine the final category of the data to be classified further includes:
按类别计算每一个类别对应的多个子标签的概率的平均值并将每一个类别对应的平均值乘以所述校正矩阵后作为校正后的第一概率,并将每一个类别的第一概率中的最大值作为数据的分类类别。Calculate the average value of the probabilities of multiple sub-labels corresponding to each category by category and multiply the average value corresponding to each category by the correction matrix as the corrected first probability, and add the first probability of each category to The maximum value of is used as the classification category of the data.
在一些实施例中,利用每一个类别对应的所述多个子标签的概率和所述校正矩阵确定所述待分类数据最终的类别,进一步包括:In some embodiments, using the probability of the plurality of sub-labels corresponding to each category and the correction matrix to determine the final category of the data to be classified further includes:
将每一个类别对应的多个子标签的概率中的最大值乘以所述校正矩阵后作为校正后的第二概率,并将概率最大的子标签对应的类别作为数据的第二分类类别。The maximum value among the probabilities of multiple sub-labels corresponding to each category is multiplied by the correction matrix as the corrected second probability, and the category corresponding to the sub-label with the highest probability is used as the second classification category of the data.
在一些实施例中,利用每一个类别对应的所述多个子标签的概率和所述校正矩阵确定所述待分类数据最终的类别,进一步包括:In some embodiments, using the probability of the plurality of sub-labels corresponding to each category and the correction matrix to determine the final category of the data to be classified further includes:
将每一个类别对应的多个子标签的概率分别乘以所述校正矩阵后按类别取平均值以作为校正后的第三概率,并将每一个类别的第三概率中的最大值作为数据的第三分类类别。Multiply the probabilities of multiple sub-labels corresponding to each category by the correction matrix and take the average value by category as the corrected third probability, and use the maximum value of the third probability of each category as the third probability of the data Three classification categories.
最后需要说明的是,本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,可以通过计算机程序来指令相关硬件来完成,程序可存储于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。Finally, it should be noted that those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing related hardware through computer programs, and the programs can be stored in a computer-readable storage medium. When the program is executed, it may include the processes of the embodiments of the above-mentioned methods.
此外,应该明白的是,本文的非易失性可读存储介质(例如,存储器)可以是易失性存储器或非易失性存储器,或者可以包括易失性存储器和非易失性存储器两者。In addition, it should be understood that the non-volatile readable storage medium (for example, memory) herein can be either volatile memory or nonvolatile memory, or can include both volatile memory and nonvolatile memory .
本申请的各个部件实施例可以以硬件实现,或者以在一个或者多个处理器上运行的软件模块实现,或者以它们的组合实现。本领域的技术人员应当理解,可以在实践中使用微处理器或者数字信号处理器(DSP)来实现根据本申请实施例的计算处理设备中的一些或者全部部件的一些或者全部功能。本申请还可以实现为用于执行这里所描述的方法的一部分或者全部的设备或者装置程序(例如,计算机程序和计算机程序产品)。这样的实现本申请的程序可以存储在计算机可读介质上,或者可以具有一个或者多个信号的形式。这样的信号可以从因特网网站上下载得到,或者在载体信号上提供,或者以任何其他形式 提供。The various component embodiments of the present application may be realized in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art should understand that a microprocessor or a digital signal processor (DSP) may be used in practice to implement some or all functions of some or all components in the computing processing device according to the embodiments of the present application. The present application can also be implemented as an apparatus or apparatus program (eg, computer program and computer program product) for performing a part or all of the methods described herein. Such a program implementing the present application may be stored on a computer-readable medium, or may be in the form of one or more signals. Such a signal may be downloaded from an Internet site, or provided on a carrier signal, or in any other form.
例如,图6示出了可以实现根据本申请的方法的计算处理设备。该计算处理设备上包括处理器710和以存储器720形式的计算机程序产品或者非易失性可读存储介质。存储器720可以是诸如闪存、EEPROM(电可擦除可编程只读存储器)、EPROM、硬盘或者ROM之类的电子存储器。存储器720具有用于执行上述方法中的任何方法步骤的程序代码731的存储空间730。例如,用于程序代码的存储空间730可以包括分别用于实现上面的方法中的各种步骤的各个程序代码731。这些程序代码可以从一个或者多个计算机程序产品中读出或者写入到这一个或者多个计算机程序产品中。这些计算机程序产品包括诸如硬盘,紧致盘(CD)、存储卡或者软盘之类的程序代码载体。这样的计算机程序产品通常为如参考图7的便携式或者固定存储单元。该存储单元可以具有与图6的计算处理设备中的存储器720类似布置的存储段、存储空间等。程序代码可以例如以适当形式进行压缩。通常,存储单元包括计算机可读代码731’,即可以由例如诸如710之类的处理器读取的代码,这些代码当由计算处理设备运行时,导致该计算处理设备执行上面所描述的方法中的各个步骤。For example, FIG. 6 shows a computing processing device that may implement a method according to the present application. The computing processing device includes thereon a processor 710 and a computer program product in the form of a memory 720 or a non-volatile readable storage medium. Memory 720 may be electronic memory such as flash memory, EEPROM (Electrically Erasable Programmable Read Only Memory), EPROM, hard disk, or ROM. The memory 720 has a storage space 730 for program code 731 for performing any method step in the method described above. For example, the storage space 730 for program codes may include respective program codes 731 for respectively implementing various steps in the above methods. These program codes can be read from or written into one or more computer program products. These computer program products comprise program code carriers such as hard disks, compact disks (CDs), memory cards or floppy disks. Such a computer program product is typically a portable or fixed storage unit as with reference to FIG. 7 . The storage unit may have storage segments, storage spaces, etc. arranged similarly to the memory 720 in the computing processing device of FIG. 6 . The program code can eg be compressed in a suitable form. Typically, the memory unit includes computer readable code 731', i.e. code readable by, for example, a processor such as 710, which code, when executed by a computing processing device, causes the computing processing device to perform the above-described method. each step.
本领域技术人员还将明白的是,结合这里的公开所描述的各种示例性逻辑块、模块、电路和算法步骤可以被实现为电子硬件、计算机软件或两者的组合。为了清楚地说明硬件和软件的这种可互换性,已经就各种示意性组件、方块、模块、电路和步骤的功能对其进行了一般性的描述。这种功能是被实现为软件还是被实现为硬件取决于具体应用以及施加给整个系统的设计约束。本领域技术人员可以针对每种具体应用以各种方式来实现的功能,但是这种实现决定不应被解释为导致脱离本申请实施例公开的范围。Those of skill would also appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described generally in terms of their functionality. Whether such functionality is implemented as software or as hardware depends upon the particular application and design constraints imposed on the overall system. Those skilled in the art may implement the functions in various ways for each specific application, but such implementation decisions should not be interpreted as causing a departure from the scope disclosed in the embodiments of the present application.
以上是本申请公开的示例性实施例,但是应当注意,在不背离权利要求限定的本申请实施例公开的范围的前提下,可以进行多种改变和修改。根据这里描述的公开实施例的方法权利要求的功能、步骤和/或动作不需以任何特定顺序执行。此外,尽管本申请实施例公开的元素可以以个体形式描述或要求,但除非明确限制为单数,也可以理解为多个。The above are the exemplary embodiments disclosed in the present application, but it should be noted that various changes and modifications can be made without departing from the scope of the embodiments disclosed in the present application defined by the claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. In addition, although the elements disclosed in the embodiments of the present application may be described or required in an individual form, they may also be understood as plural unless explicitly limited to a singular number.
应当理解的是,在本文中使用的,除非上下文清楚地支持例外情况,单数形式“一个”旨在也包括复数形式。还应当理解的是,在本文中使用的“和/或”是指包括一个或者一个以上相关联地列出的项目的任意和所有可能组合。It should be understood that as used herein, the singular form "a" and "an" are intended to include the plural forms as well, unless the context clearly supports an exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.
上述本申请实施例公开实施例序号仅仅为了描述,不代表实施例的优劣。The serial numbers of the embodiments disclosed in the above-mentioned embodiments of the present application are only for description, and do not represent the advantages and disadvantages of the embodiments.
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,程序可以存储于一种非易失性可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。Those of ordinary skill in the art can understand that all or part of the steps for implementing the above embodiments can be completed by hardware, or can be completed by instructing related hardware through a program, and the program can be stored in a non-volatile readable storage medium. The storage medium mentioned may be a read-only memory, a magnetic disk or an optical disk, and the like.
所属领域的普通技术人员应当理解:以上任何实施例的讨论仅为示例性的,并非旨在暗示本申请实施例公开的范围(包括权利要求)被限于这些例子;在本申请实施例的思路下,以上实施例或者不同实施例中的技术特征之间也可以进行组合,并存在如上的本申请实施例的不同方面的许多其它变化,为了简明它们没有在细节中提供。因此,凡在本申请实施例的精神和原则之内,所做的任何省略、修改、等同替换、改进等,均应包含在本申请实施例的保护范围之内。Those of ordinary skill in the art should understand that: the discussion of any of the above embodiments is exemplary only, and is not intended to imply that the scope (including claims) disclosed by the embodiments of the present application is limited to these examples; under the idea of the embodiments of the present application , the technical features in the above embodiments or different embodiments can also be combined, and there are many other changes in different aspects of the above embodiments of the present application, which are not provided in details for the sake of brevity. Therefore, within the spirit and principle of the embodiments of the present application, any omissions, modifications, equivalent replacements, improvements, etc., shall be included in the protection scope of the embodiments of the present application.

Claims (20)

  1. 一种分类结果校正方法,其中,包括以下步骤:A classification result correction method, which includes the following steps:
    构建数据集并对所述数据集中的每一个数据标注一个对应类别的分类标签;Constructing a data set and marking each data in the data set with a classification label of a corresponding category;
    将所述数据集中的每一个数据输入到已训练模型中以得到对应的所述分类标签的概率并利用每一个数据对应的所述分类标签概率计算校正矩阵;Input each data in the data set into the trained model to obtain the probability of the corresponding classification label and calculate a correction matrix using the classification label probability corresponding to each data;
    将每一个类别的所述分类标签扩展为多个子标签;expanding said classification label for each category into a plurality of sub-labels;
    将所述已训练模型的输出调整为每一个类别对应的多个子标签的概率;adjusting the output of the trained model to the probability of a plurality of sub-labels corresponding to each category;
    将待分类的数据输入到已训练模型中以得到每一个类别对应的多个子标签的概率;Input the data to be classified into the trained model to obtain the probability of multiple sub-labels corresponding to each category;
    利用每一个类别对应的所述多个子标签的概率和所述校正矩阵确定所述待分类数据最终的类别。The final category of the data to be classified is determined by using the probabilities of the plurality of sub-labels corresponding to each category and the correction matrix.
  2. 如权利要求1所述的方法,将所述数据集中的每一个数据输入到已训练模型中以得到对应的所述分类标签的概率并利用每一个数据对应的所述分类标签概率计算校正矩阵,进一步包括:The method according to claim 1, inputting each data in the data set into the trained model to obtain the probability of the corresponding classification label and using the classification label probability corresponding to each data to calculate a correction matrix, Further includes:
    将每一个数据对应的分类标签的概率按类别求和取均值以得到每一个类别对应的概率;The probability of the classification label corresponding to each data is summed and averaged by category to obtain the probability corresponding to each category;
    对每一个类别对应的概率进行归一化处理后构建对角矩阵;After normalizing the probability corresponding to each category, a diagonal matrix is constructed;
    将所述对角矩阵求逆后得到所述校正矩阵。The correction matrix is obtained after inverting the diagonal matrix.
  3. 如权利要求1所述的方法,将每一个类别的所述分类标签扩展为多个子标签,进一步包括:The method according to claim 1, expanding the classification label of each category into a plurality of sub-labels, further comprising:
    利用预设模型获取每一个类别的所述分类标签对应的多个近义词;Using a preset model to obtain a plurality of synonyms corresponding to the classification labels of each category;
    分别从每一个类别对应的所述多个近义词中筛选预设数量的词作为每一个类别对应的所述多个子标签。Respectively select a preset number of words from the plurality of synonyms corresponding to each category as the plurality of subtags corresponding to each category.
  4. 根据权利要求3所述的方法,在所述利用预设模型获取每一个类别的所述分类标签对应的多个近义词之前,还包括:The method according to claim 3, before said using a preset model to obtain a plurality of synonyms corresponding to the classification labels of each category, further comprising:
    利用预设数量的中文单词短语的嵌入语料库,训练得到所述预设模型。The preset model is obtained through training with a preset number of embedded corpora of Chinese word phrases.
  5. 如权利要求3所述的方法,分别从每一个类别对应的所述多个近义词中筛选预设数量的词作为每一个类别对应的所述多个子标签,进一步包括:The method according to claim 3, respectively screening a preset number of words from the plurality of synonyms corresponding to each category as the plurality of subtags corresponding to each category, further comprising:
    将所述多个近义词中不存在于所述已训练模型的词表中词删除;Deleting words that do not exist in the vocabulary of the trained model in the plurality of synonyms;
    将所述已训练模型的输出调整为剩余的近义词的概率;adjusting the output of the trained model to the probabilities of the remaining synonyms;
    将所述数据集中的每一个数据输入到已训练模型中以得到所述剩余的近义词的概率;Input each data in the data set into the trained model to obtain the probability of the remaining synonyms;
    根据所述已训练模型输出的所述剩余的近义词的概率将所述剩余的近义词中概率低于第一阈值的词删除;deleting the words whose probability is lower than the first threshold among the remaining synonyms according to the probability of the remaining synonyms output by the trained model;
    将再次剩余的近义词中概率差值小于第二阈值的词删除并选择概率最大的预设数量的词作为每一个类别对应的所述多个子标签。Among the remaining synonyms, the words whose probability difference is smaller than the second threshold are deleted, and a preset number of words with the highest probability are selected as the plurality of sub-labels corresponding to each category.
  6. 根据权利要求5所述的方法,所述根据所述已训练模型输出的所述剩余的近义词的概率将所述剩余的近义词中概率低于第一阈值的词删除,进一步包括:The method according to claim 5, wherein the probability of the remaining synonyms output according to the trained model is to delete the words whose probability is lower than the first threshold in the remaining synonyms, further comprising:
    据所述已训练模型输出的所述剩余的近义词的概率将所述剩余的近义词中概率低于平均值的近义词划分为稀有词,并删除所述稀有词。According to the probabilities of the remaining synonyms output by the trained model, among the remaining synonyms, synonyms whose probability is lower than the average value are classified as rare words, and the rare words are deleted.
  7. 根据权利要求5所述的方法,所述将所述多个近义词中不存在于所述已训练模型的词表中词删除,进一步包括:The method according to claim 5, said deleting words that do not exist in the vocabulary of the trained model in the plurality of synonyms, further comprising:
    采用遍历方式查看所述多个近义词中每个近义词是否在所述已训练模型的词表空间,并删除不在所述词表空间内的近义词。Checking whether each of the plurality of synonyms is in the vocabulary space of the trained model in a traversal manner, and deleting synonyms that are not in the vocabulary space.
  8. 根据权利要求5所述的方法,所述将再次剩余的近义词中概率差值小于第二阈值的词,进一步包括:The method according to claim 5, the words whose probability difference value is less than the second threshold among the remaining synonyms, further comprising:
    获取再次剩余的近义词中的同义词,删除同义词中除概率最大的同义词外的其它同义词。Obtain the synonyms among the remaining synonyms, and delete the synonyms except the one with the highest probability.
  9. 根据权利要求5所述的方法,所述选择概率最大的预设数量的词作为每一个类别对应的所述多个子标签,进一步包括:The method according to claim 5, said selecting a preset number of words with the highest probability as the plurality of sub-tags corresponding to each category, further comprising:
    按照概率从大到小的顺序对再次剩余的近义词中概率差值小于第二阈值的词进行排序,并选择排序在前预设数量的词作为每一个类别对应的所述多个子标签。Rank the words whose probability difference is smaller than the second threshold among the remaining synonyms in descending order of probability, and select a preset number of words ranked first as the plurality of sub-labels corresponding to each category.
  10. 根据权利要求3至9任一项所述的方法,所述预设模型为word2vec模型。According to the method according to any one of claims 3 to 9, the preset model is a word2vec model.
  11. 如权利要求1所述的方法,利用每一个类别对应的所述多个子标签的概率和所述校正矩阵确定所述待分类数据最终的类别,进一步包括:The method according to claim 1, using the probability of the plurality of sub-labels corresponding to each category and the correction matrix to determine the final category of the data to be classified, further comprising:
    按类别计算每一个类别对应的多个子标签的概率的平均值并将每一个类别对应的平均值乘以所述校正矩阵后作为校正后的第一概率,并将每一个类 别的第一概率中的最大值作为数据的分类类别。Calculate the average value of the probabilities of multiple sub-labels corresponding to each category by category and multiply the average value corresponding to each category by the correction matrix as the corrected first probability, and add the first probability of each category to The maximum value of is used as the classification category of the data.
  12. 如权利要求1所述的方法,利用每一个类别对应的所述多个子标签的概率和所述校正矩阵确定所述待分类数据最终的类别,进一步包括:The method according to claim 1, using the probability of the plurality of sub-labels corresponding to each category and the correction matrix to determine the final category of the data to be classified, further comprising:
    将每一个类别对应的多个子标签的概率中的最大值乘以所述校正矩阵后作为校正后的第二概率,并将概率最大的子标签对应的类别作为数据的第二分类类别。The maximum value among the probabilities of multiple sub-labels corresponding to each category is multiplied by the correction matrix as the corrected second probability, and the category corresponding to the sub-label with the highest probability is used as the second classification category of the data.
  13. 如权利要求1所述的方法,利用每一个类别对应的所述多个子标签的概率和所述校正矩阵确定所述待分类数据最终的类别,进一步包括:The method according to claim 1, using the probability of the plurality of sub-labels corresponding to each category and the correction matrix to determine the final category of the data to be classified, further comprising:
    将每一个类别对应的多个子标签的概率分别乘以所述校正矩阵后按类别取平均值以作为校正后的第三概率,并将每一个类别的第三概率中的最大值作为数据的第三分类类别。Multiply the probabilities of multiple sub-labels corresponding to each category by the correction matrix and take the average value by category as the corrected third probability, and use the maximum value of the third probability of each category as the third probability of the data Three classification categories.
  14. 根据权利要求1至13任一项所述的方法,所述预训练模型为PLM模型。The method according to any one of claims 1 to 13, the pre-training model is a PLM model.
  15. 一种分类结果校正系统,其中,包括:A classification result correction system, including:
    构建模块,配置为构建数据集并对所述数据集中的每一个数据标注一个对应类别的分类标签;A building module configured to construct a data set and mark each data in the data set with a classification label of a corresponding category;
    计算模块,配置为将所述数据集中的每一个数据输入到已训练模型中以得到对应的所述分类标签的概率并利用每一个数据对应的所述分类标签概率计算校正矩阵;A calculation module configured to input each data in the data set into the trained model to obtain the probability of the corresponding classification label and calculate a correction matrix using the classification label probability corresponding to each data;
    扩展模块,配置为将每一个类别的所述分类标签扩展为多个子标签;An expansion module configured to expand the classification label of each category into a plurality of sub-labels;
    调整模块,配置为将所述已训练模型的输出调整为每一个类别对应的多个子标签的概率;An adjustment module configured to adjust the output of the trained model to the probability of multiple sub-labels corresponding to each category;
    输入模块,配置为将待分类的数据输入到已训练模型中以得到每一个类别对应的多个子标签的概率;The input module is configured to input the data to be classified into the trained model to obtain the probability of a plurality of sub-labels corresponding to each category;
    校正模块,配置为利用每一个类别对应的所述多个子标签的概率和所述校正矩阵确定所述待分类数据最终的类别。The correction module is configured to determine the final category of the data to be classified by using the probabilities of the plurality of sub-labels corresponding to each category and the correction matrix.
  16. 根据权利要求15所述的装置,所述计算模块还配置为:The apparatus according to claim 15, the computing module is further configured to:
    将每一个数据对应的分类标签的概率按类别求和取均值以得到每一个类别对应的概率;The probability of the classification label corresponding to each data is summed and averaged by category to obtain the probability corresponding to each category;
    对每一个类别对应的概率进行归一化处理后构建对角矩阵;After normalizing the probability corresponding to each category, a diagonal matrix is constructed;
    将所述对角矩阵求逆后得到所述校正矩阵。The correction matrix is obtained after inverting the diagonal matrix.
  17. 一种计算机设备,包括:A computer device comprising:
    至少一个处理器;以及at least one processor; and
    存储器,所述存储器存储有可在所述处理器上运行的计算机程序,其中,所述处理器执行所述程序时执行如权利要求1-14任意一项所述的分类结果校正方法的步骤。A memory, the memory stores a computer program that can run on the processor, wherein, when the processor executes the program, the steps of the classification result correction method according to any one of claims 1-14 are executed.
  18. 一种非易失性可读存储介质,所述非易失性可读存储介质存储有计算机程序,其中,所述计算机程序被处理器执行时执行如权利要求1-14任意一项所述的分类结果校正方法的步骤。A non-volatile readable storage medium, the non-volatile readable storage medium stores a computer program, wherein, when the computer program is executed by a processor, the method described in any one of claims 1-14 is executed. The steps of the classification result correction method.
  19. 一种计算处理设备,包括:A computing processing device comprising:
    存储器,其中存储有计算机可读代码;a memory having computer readable code stored therein;
    一个或多个处理器,当所述计算机可读代码被所述一个或多个处理器执行时,所述计算处理设备执行权利要求1-14任意一项所述的分类结果校正方法的步骤。One or more processors, when the computer readable code is executed by the one or more processors, the computing processing device executes the steps of the method for correcting classification results according to any one of claims 1-14.
  20. 一种计算机程序产品,包括计算机可读代码,当所述计算机可读代码在计算处理设备上运行时,导致所述计算处理设备执行根据权利要求1-14任意一项所述的分类结果校正方法的步骤。A computer program product, comprising computer readable code, when the computer readable code is run on a computing processing device, causing the computing processing device to execute the classification result correction method according to any one of claims 1-14 A step of.
PCT/CN2022/122302 2022-02-14 2022-09-28 Classification result correction method and system, device, and medium WO2023151284A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210133548.8 2022-02-14
CN202210133548.8A CN114186065B (en) 2022-02-14 2022-02-14 Classification result correction method, system, device and medium

Publications (1)

Publication Number Publication Date
WO2023151284A1 true WO2023151284A1 (en) 2023-08-17

Family

ID=80545885

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/122302 WO2023151284A1 (en) 2022-02-14 2022-09-28 Classification result correction method and system, device, and medium

Country Status (2)

Country Link
CN (1) CN114186065B (en)
WO (1) WO2023151284A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114186065B (en) * 2022-02-14 2022-05-17 苏州浪潮智能科技有限公司 Classification result correction method, system, device and medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190317986A1 (en) * 2018-04-13 2019-10-17 Preferred Networks, Inc. Annotated text data expanding method, annotated text data expanding computer-readable storage medium, annotated text data expanding device, and text classification model training method
CN111326148A (en) * 2020-01-19 2020-06-23 北京世纪好未来教育科技有限公司 Confidence correction and model training method, device, equipment and storage medium thereof
CN111460150A (en) * 2020-03-27 2020-07-28 北京松果电子有限公司 Training method, classification method and device of classification model and storage medium
CN113723438A (en) * 2020-05-20 2021-11-30 罗伯特·博世有限公司 Classification model calibration
CN113987136A (en) * 2021-11-29 2022-01-28 沈阳东软智能医疗科技研究院有限公司 Method, device and equipment for correcting text classification label and storage medium
CN114186065A (en) * 2022-02-14 2022-03-15 苏州浪潮智能科技有限公司 Classification result correction method, system, device and medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111382248B (en) * 2018-12-29 2023-05-23 深圳市优必选科技有限公司 Question replying method and device, storage medium and terminal equipment
CN110232397A (en) * 2019-04-22 2019-09-13 广东工业大学 A kind of multi-tag classification method of combination supporting vector machine and projection matrix
CN110490849A (en) * 2019-08-06 2019-11-22 桂林电子科技大学 Surface Defects in Steel Plate classification method and device based on depth convolutional neural networks

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190317986A1 (en) * 2018-04-13 2019-10-17 Preferred Networks, Inc. Annotated text data expanding method, annotated text data expanding computer-readable storage medium, annotated text data expanding device, and text classification model training method
CN111326148A (en) * 2020-01-19 2020-06-23 北京世纪好未来教育科技有限公司 Confidence correction and model training method, device, equipment and storage medium thereof
CN111460150A (en) * 2020-03-27 2020-07-28 北京松果电子有限公司 Training method, classification method and device of classification model and storage medium
CN113723438A (en) * 2020-05-20 2021-11-30 罗伯特·博世有限公司 Classification model calibration
CN113987136A (en) * 2021-11-29 2022-01-28 沈阳东软智能医疗科技研究院有限公司 Method, device and equipment for correcting text classification label and storage medium
CN114186065A (en) * 2022-02-14 2022-03-15 苏州浪潮智能科技有限公司 Classification result correction method, system, device and medium

Also Published As

Publication number Publication date
CN114186065A (en) 2022-03-15
CN114186065B (en) 2022-05-17

Similar Documents

Publication Publication Date Title
CN109840287B (en) Cross-modal information retrieval method and device based on neural network
US11113479B2 (en) Utilizing a gated self-attention memory network model for predicting a candidate answer match to a query
JP6712642B2 (en) Model learning device, method and program
US20200265192A1 (en) Automatic text summarization method, apparatus, computer device, and storage medium
CN111310443B (en) Text error correction method and system
JP6222821B2 (en) Error correction model learning device and program
US20190034784A1 (en) Fixed-point training method for deep neural networks based on dynamic fixed-point conversion scheme
US10580432B2 (en) Speech recognition using connectionist temporal classification
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
CN110457719B (en) Translation model result reordering method and device
TWI567569B (en) Natural language processing systems, natural language processing methods, and natural language processing programs
WO2023151284A1 (en) Classification result correction method and system, device, and medium
CN115795009A (en) Cross-language question-answering system construction method and device based on generating type multi-language model
CN110275928B (en) Iterative entity relation extraction method
CN111462751A (en) Method, apparatus, computer device and storage medium for decoding voice data
US20210406464A1 (en) Skill word evaluation method and device, electronic device, and non-transitory computer readable storage medium
CN112417848A (en) Corpus generation method and device and computer equipment
CN110516109B (en) Music label association method and device and storage medium
CN113626563A (en) Method and electronic equipment for training natural language processing model and natural language processing
CN113743117B (en) Method and device for entity labeling
US20230367976A1 (en) Model training methods and apparatuses, text processing methods and apparatuses, and computer devices
CN117094325A (en) Named entity identification method in rice pest field
JP2010128774A (en) Inherent expression extraction apparatus, and method and program for the same
CN116403569A (en) Speech recognition method, device, computer equipment and medium based on artificial intelligence
US20220319501A1 (en) Stochastic future context for speech processing

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22925642

Country of ref document: EP

Kind code of ref document: A1