CN114330618A - Pseudo label-based two-class label data optimization method, device and medium - Google Patents

Pseudo label-based two-class label data optimization method, device and medium Download PDF

Info

Publication number
CN114330618A
CN114330618A CN202111663474.0A CN202111663474A CN114330618A CN 114330618 A CN114330618 A CN 114330618A CN 202111663474 A CN202111663474 A CN 202111663474A CN 114330618 A CN114330618 A CN 114330618A
Authority
CN
China
Prior art keywords
label
data
optimization
tag
sets
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111663474.0A
Other languages
Chinese (zh)
Other versions
CN114330618B (en
Inventor
陈英鹏
许野平
刘辰飞
张朝瑞
席道亮
高朋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Synthesis Electronic Technology Co Ltd
Original Assignee
Synthesis Electronic Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Synthesis Electronic Technology Co Ltd filed Critical Synthesis Electronic Technology Co Ltd
Priority to CN202111663474.0A priority Critical patent/CN114330618B/en
Publication of CN114330618A publication Critical patent/CN114330618A/en
Application granted granted Critical
Publication of CN114330618B publication Critical patent/CN114330618B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application discloses a two-classification label data optimization method, equipment and medium based on a pseudo label, which are used for solving the following technical problems: how to effectively identify the error label data in the two-classification label data based on the pseudo label technology. The method comprises the following steps: dividing two classification label data sets to be optimized into a preset number of optimization sets; determining a preset number of training sets based on a preset number of optimization sets, and training the model to be trained to obtain a preset number of initial models; verifying the corresponding optimization sets respectively through a preset number of initial models to determine type prediction scores of the two kinds of label data in the corresponding optimization sets; wherein, the corresponding optimization set is an optimization set which is not adopted when the initial model is obtained through training; and determining whether the corresponding two-classification label data is error label data or not through a preset evaluation rule based on the type prediction score. By the method, the error label data in the two-classification label data can be effectively identified.

Description

Pseudo label-based two-class label data optimization method, device and medium
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a method, equipment and medium for optimizing data of two-class labels based on pseudo labels.
Background
In recent years, convolutional neural networks make a major breakthrough in various computational vision tasks, but the design of convolutional neural networks is becoming more and more complex, and because of the complexity, a large amount of data is required to train the convolutional neural networks to obtain a converged corresponding recognition model. In order to train the convolutional neural network, it is very troublesome to collect a large amount of label-free data and label the large amount of label-free data, and the phenomenon of label errors is very easy to occur.
However, in a real application scenario, the validity problem of the label is not considered, and a few label errors of the artificial label data often exist in the large amount of label data, and the label errors seriously affect the training accuracy of the recognition model. With the development of the technology, semi-supervised learning and unsupervised learning modes are more and more concerned, wherein the pseudo-label technology is a semi-supervised learning technology, and is a technology for predicting unlabelled data by using a model trained on labeled data, and a sample can be screened according to a prediction result to optimize a sample structure. Therefore, this also brings a development opportunity for mislabel determination in the labeled data.
In the tag data, the application of the two-class tag data, that is, the tag data having the tag type of only "0" or "1", is very wide. How to effectively identify the error label data in the two-classification label data based on the pseudo label technology becomes a problem to be solved urgently.
Disclosure of Invention
The embodiment of the application provides a method, equipment and a medium for optimizing data of a two-class label based on a pseudo label, which are used for solving the following technical problems: how to effectively identify the error label data in the two-classification label data based on the pseudo label technology.
In a first aspect, an embodiment of the present application provides a pseudo tag-based method for optimizing data of a binary tag, where the method includes: dividing two classification label data sets to be optimized into a preset number of optimization sets; determining a preset number of training sets based on a preset number of optimization sets, and training the model to be trained to obtain a preset number of initial models; wherein the training set comprises a preset number minus one optimization set; verifying the corresponding optimization sets respectively through a preset number of initial models to determine type prediction scores of the two kinds of label data in the corresponding optimization sets; wherein, the corresponding optimization set is an optimization set which is not adopted when the initial model is obtained through training; and determining whether the corresponding two-classification label data is error label data or not through a preset evaluation rule based on the type prediction score.
According to the two-classification label data optimization method based on the pseudo label, by circularly verifying the label data and utilizing the pseudo label technology, the error label data in the two-classification label data can be effectively screened out, and the label accuracy of the two-classification label data is optimized, so that the accuracy of the identification model can be greatly improved in the process of training the identification model later.
In an implementation manner of the present application, determining whether corresponding two-class tag data is false tag data or not through a preset evaluation rule based on the type prediction score specifically includes: generating a pseudo label for the two-classification label data based on the type prediction score and a preset score classification threshold; wherein, the pseudo label is a binary label; judging whether the label of the pseudo label is the same as that of the two-class label data; and under the condition that the label of the pseudo label is different from that of the two-class label data, determining whether the two-class label data is false label data or not based on a preset first label score deviation threshold value and a preset second label score deviation threshold value.
In an implementation manner of the present application, generating a pseudo tag for two-class tag data based on a type prediction score and a preset score classification threshold specifically includes: determining whether the type prediction score is less than a score classification threshold; generating a pseudo label of a first label type for the two-class label data under the condition that the type prediction score is smaller than the score classification threshold value; wherein the first tag type is a "0" tag; under the condition that the type prediction score is not smaller than the score classification threshold value, generating a pseudo label of a second label type for the two-class label data; wherein the second tag type is a "1" tag.
In an implementation manner of the present application, determining whether the second classification tag data is the false tag data based on a preset first tag score deviation threshold and a preset second tag score deviation threshold specifically includes: under the condition that the label of the two-classification label data is of a first label type and the pseudo label is of a second label type, if the type prediction score is larger than a first label score deviation threshold value, determining the two-classification label data as error label data; and under the condition that the label of the two-classification label data is of the second label type and the pseudo label is of the first label type, if the type prediction score is smaller than the second label score deviation threshold value, determining that the two-classification label data is false label data.
In an implementation manner of the present application, dividing a to-be-optimized two-class tag data set into a preset number of optimized sets specifically includes: determining a first quantity corresponding to first tag type data and a second quantity corresponding to second tag type data in a to-be-optimized classified tag data set, and determining whether the absolute value of the difference value between the first quantity and the second quantity is smaller than a first preset threshold value; under the condition that the absolute value of the difference value between the first quantity and the second quantity is smaller than a first preset threshold value, dividing the two classified label data sets to be optimized into a preset quantity of optimized sets with the same quantity of label data; and the absolute value of the difference value between the third quantity of the first label data and the fourth quantity of the second label data in the optimization set is smaller than a second preset threshold value.
In one implementation of the present application, the method further comprises: and under the condition that the absolute value of the difference value between the first quantity and the second quantity is not smaller than a first preset threshold value, adjusting the quantity of the first label type data in the to-be-optimized two-class label data set, or adjusting the quantity of the second label type data in the to-be-optimized two-class label data set, so that the absolute value of the difference value between the first quantity corresponding to the first label type data and the second quantity corresponding to the second label type data is smaller than the first preset threshold value.
In an implementation manner of the present application, based on a preset number of optimization sets, a preset number of training sets is determined, which specifically includes: determining any one of a preset number of optimization sets as a corresponding optimization set, and forming other optimization sets except the corresponding optimization set into a training set corresponding to the corresponding optimization set; and traversing a preset number of optimization sets to obtain a preset number of training sets corresponding to each optimization set.
In one implementation manner of the present application, before training the model to be trained, the method further includes: and adding a verification module in the model to be trained, so that the initial model obtained through training can determine the type prediction score of each piece of classified label data when the corresponding optimization set is verified.
In a second aspect, an embodiment of the present application further provides a pseudo tag-based binary tag data optimization device, where the device includes: a processor; and a memory having executable code stored thereon, which when executed, causes the processor to perform a method according to any one of claims 1-8.
In a third aspect, an embodiment of the present application further provides a non-volatile computer storage medium for pseudo tag-based two-class tag data optimization, where the non-volatile computer storage medium stores computer-executable instructions, and the computer-executable instructions are configured to: dividing two classification label data sets to be optimized into a preset number of optimization sets; determining a preset number of training sets based on a preset number of optimization sets, and training the model to be trained to obtain a preset number of initial models; wherein the training set comprises a preset number minus one optimization set; verifying the corresponding optimization sets respectively through a preset number of initial models to determine type prediction scores of the two kinds of label data in the corresponding optimization sets; wherein, the corresponding optimization set is an optimization set which is not adopted when the initial model is obtained through training; and determining whether the corresponding two-classification label data is error label data or not through a preset evaluation rule based on the type prediction score.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
fig. 1 is a flowchart of a method for optimizing data of a binary label based on a pseudo label according to an embodiment of the present application;
fig. 2 is a schematic diagram of an internal structure of a pseudo tag-based two-class tag data optimization device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The embodiment of the application provides a method, equipment and a medium for optimizing data of a two-class label based on a pseudo label, which are used for solving the following technical problems: how to effectively identify the error label data in the two-classification label data based on the pseudo label technology.
The technical solutions proposed in the embodiments of the present application are described in detail below with reference to the accompanying drawings.
Fig. 1 is a flowchart of a method for optimizing data of a binary label based on a pseudo label according to an embodiment of the present application. As shown in fig. 1, a method for optimizing data of a two-class label based on a pseudo label provided in an embodiment of the present application mainly includes the following steps:
step 101, dividing the two classified label data sets to be optimized into a preset number of optimized sets.
First, it should be noted that, in the embodiment of the present application, a "0" tag of the two-class tag data is determined as the first tag type, and a "1" tag of the two-class tag data is determined as the second tag type.
In one embodiment of the present application, after determining the two-category tag data sets to be optimized, the two-category tag data sets to be optimized need to be first divided into a preset number of optimization sets.
Specifically, a first quantity corresponding to first label type data and a second quantity corresponding to second label type data in a to-be-optimized classified label data set are determined; i.e., the number of "0" tag data and "1" tag data, respectively, is determined. After determining the first number and the second number, it is determined whether an absolute value of a difference between the first number and the second number is less than a first preset threshold. It should be noted that, in order to enable the trained recognition model to better recognize and stabilize, the data volumes of the two label types should not be too far apart.
Further, if it is determined that the absolute value of the difference between the first number and the second number is not less than the first preset threshold, the number of the first tag type data in the to-be-optimized two-class tag data set is adjusted, or the number of the second tag type data in the to-be-optimized two-class tag data set is adjusted, so that the absolute value of the difference between the first number corresponding to the first tag type data and the second number corresponding to the second tag type data is less than the first preset threshold.
Further, if it is determined that the absolute value of the difference between the first number and the second number is smaller than a first preset threshold, dividing the to-be-optimized two-class label data set into a preset number of optimized sets with equal number of label data. It should be noted that, the determination of the optimization sets is performed to perform the cyclic verification on the mutual training model, so that an absolute value of a difference between a third quantity corresponding to the first tag data and a fourth quantity corresponding to the second tag data in each optimization set should be smaller than a second preset threshold, that is, the difference between the data quantities of the two tag types in each optimization set should not be too large.
Step 102, determining a preset number of training sets based on a preset number of optimization sets, and training the model to be trained to obtain a preset number of initial models.
In an embodiment of the present application, after dividing the to-be-optimized two-class label data set into a preset number of optimization sets, a preset number of training sets needs to be determined based on the preset number of optimization sets.
Specifically, each training set consists of a preset number minus one optimization set. Therefore, each time one optimization set is eliminated, the other optimization sets can be combined into one training set, and after all the optimization sets are traversed, a preset number of training sets can be determined. And the rejected optimization set corresponding to each training set is the corresponding optimization set corresponding to the training set. For example, five optimization sets are obtained by dividing the two classification label data sets L to be optimized, and are respectively numbered as F1-F5, then five training sets can be determined according to the five optimization sets; assuming that it is determined (F2, F3, F4, F5) that the training set S1 is composed, F1 is a corresponding optimization set corresponding to the training set S1.
Further, by training the models to be trained respectively through the preset number of training sets, the preset number of initial models can be obtained through training. It will be appreciated that since each training set corresponds to a respective optimization set, each initial model also corresponds to a respective optimization set.
And 103, verifying the corresponding optimization sets respectively through a preset number of initial models to determine the type prediction scores of the two-class label data in the corresponding optimization sets.
In an embodiment of the application, after training a model to be trained based on a preset number of training sets to obtain a preset number of initial models, verifying a corresponding optimization set corresponding to each initial model based on the preset number of initial models, respectively. It can be understood that the training process of each initial model does not include its corresponding respective optimization set, and therefore, the binary label data included in its corresponding respective optimization set corresponds to unknown data, and therefore, it is possible to identify whether it is data of the first label type or data of the second label type.
In one embodiment of the application, machine identification is ambiguous, only chosen based on probability, due to the specific tag type data. In order to further accurately determine the label type of the data, after the corresponding data set is identified by the initial model, the probability of each two-classification label data in the corresponding optimization set is directly output, and the probability is determined as the type prediction score of the corresponding two-classification label data. It is understood that the type prediction score ranges from 0 to 1.
In an embodiment of the present application, since the type prediction score is output by the initial model, before the model to be trained is trained, a verification module needs to be added to the model to be trained, and after the model to be trained is trained by the verification module, the type prediction score of each piece of binary label data can be determined when the corresponding optimization set is verified.
It can be understood that after each initial model verifies the corresponding optimization set, the two kinds of tag data to be optimized are verified once, that is, each two kinds of tag data in the two kinds of tag data sets to be optimized obtain a type prediction score.
And step 104, determining whether the corresponding two-classification label data is wrong label data or not through a preset evaluation rule based on the type prediction score.
In one embodiment of the present application, after each piece of binary label data is verified through a corresponding initial model to determine a type prediction score, it is determined whether the type prediction score is smaller than a preset score classification threshold. It can be immediate that the score classification threshold value and the type prediction score have the same value range, which is between 0 and 1.
Further, if the type prediction score of the two-class label data is smaller than the score classification threshold value, generating a pseudo label of the first label type for the two-class label data; that is, if the type prediction score of the binary label data is smaller than the score classification threshold, the binary label data is further labeled with a pseudo label, and the pseudo label is a "0" label. If the type prediction score of the two-class label data is not smaller than the score classification threshold value, generating a pseudo label of a second label type for the two-class label data; that is, if the type prediction score of the binary label data is greater than or equal to the score classification threshold, the binary label data is further labeled with a pseudo label, and the pseudo label is a "1" label.
Further, since the two-class tag data is the tagged data, there is a tag in the two-class tag data itself, and a pseudo tag is generated through the above process, so that each two-class tag data has two tags, one tag is the tag tagged by the two-class tag data itself, and the other tag is the generated pseudo tag. After the pseudo label is generated, whether the label of the pseudo label is the same as that of the binary label data is judged. And if the pseudo label is the same as the label of the binary label data, determining that the original label is not marked with errors. If the pseudo tag is not the same as the tag of the binary tag data, further determination is required.
Further, if the label of the second classification label data is of the first label type and the pseudo label is of the second label type, whether the type prediction score of the second classification label data is larger than the first label score deviation threshold value or not is judged. It is to be understood that the first tag score deviation threshold is a threshold set for verifying whether the binary tag data in which the self tag is of the first tag type and the pseudo tag is of the second tag type is the false tag data. And if the type prediction score of the two-classification label data is larger than the first label score deviation threshold value, determining the two-classification label data as error label data.
And if the label of the two-class label data is of the second label type and the pseudo label is of the first label type, judging whether the type prediction score of the two-class label data is smaller than a second label score deviation threshold value. It is to be understood that the second tag score deviation threshold is a threshold set for verifying whether the binary tag data of which the self tag is of the second tag type and the pseudo tag is of the first tag type is the false tag data. And if the type prediction score of the two-classification label data is smaller than a second label score deviation threshold value, determining the two-classification label data as error label data.
It is understood that the first tag score deviation threshold and the second tag score deviation threshold both range from 0 to 1.
In an embodiment of the application, after determining that one piece of binary label data is the error label data, the error label data can be removed, and the label of the error label data can be modified, so that the initial model is continuously trained, and an accurate and stable identification model is obtained.
Based on the same inventive concept, the embodiment of the present application further provides a two-class tag data optimization device based on a pseudo tag, and an internal structure of the two-class tag data optimization device is shown in fig. 2.
Fig. 2 is a schematic diagram of an internal structure of a pseudo tag-based two-class tag data optimization device according to an embodiment of the present application. As shown in fig. 2, the apparatus includes: a processor 201; a memory 202 having stored thereon executable instructions that, when executed, cause the processor 201 to perform a pseudo tag based two-class tag data optimization method as described above.
In an embodiment of the present application, the processor 201 is configured to divide the two classified tag data sets to be optimized into a preset number of optimized sets; determining a preset number of training sets based on a preset number of optimization sets, and training the model to be trained to obtain a preset number of initial models; wherein the training set comprises a preset number minus one optimization set; verifying the corresponding optimization sets respectively through a preset number of initial models to determine type prediction scores of the two kinds of label data in the corresponding optimization sets; wherein, the corresponding optimization set is an optimization set which is not adopted when the initial model is obtained through training; and determining whether the corresponding two-classification label data is error label data or not through a preset evaluation rule based on the type prediction score.
Some embodiments of the present application provide a non-volatile computer storage medium corresponding to one of the pseudo tag based binary tag data optimizations of fig. 1, having stored thereon computer-executable instructions configured to:
dividing two classification label data sets to be optimized into a preset number of optimization sets;
determining a preset number of training sets based on a preset number of optimization sets, and training the model to be trained to obtain a preset number of initial models; wherein the training set comprises a preset number minus one optimization set;
verifying the corresponding optimization sets respectively through a preset number of initial models to determine type prediction scores of the two kinds of label data in the corresponding optimization sets; wherein, the corresponding optimization set is an optimization set which is not adopted when the initial model is obtained through training;
and determining whether the corresponding two-classification label data is error label data or not through a preset evaluation rule based on the type prediction score.
The embodiments in the present application are described in a progressive manner, and the same and similar parts among the embodiments can be referred to each other, and each embodiment focuses on the differences from the other embodiments. Especially, for the internet of things device and medium embodiments, since they are substantially similar to the method embodiments, the description is simple, and the relevant points can be referred to the partial description of the method embodiments.
The system and the medium provided by the embodiment of the application correspond to the method one to one, so the system and the medium also have the beneficial technical effects similar to the corresponding method.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (10)

1. A pseudo label based binary label data optimization method is characterized by comprising the following steps:
dividing two classification label data sets to be optimized into a preset number of optimization sets;
determining a preset number of training sets based on the preset number of optimization sets, and training the model to be trained to obtain a preset number of initial models; wherein the training set comprises a preset number minus one optimization set;
verifying the corresponding optimization sets respectively through the preset number of initial models to determine the type prediction scores of the two-class label data in the corresponding optimization sets; wherein, the corresponding optimization set is an optimization set which is not adopted when the initial model is obtained through training;
and determining whether the corresponding two-classification label data is wrong label data or not through a preset evaluation rule based on the type prediction score.
2. The pseudo tag-based two-class tag data optimization method according to claim 1, wherein determining whether the corresponding two-class tag data is false tag data or not through a preset evaluation rule based on the type prediction score specifically includes:
generating a pseudo label for the two-class label data based on the type prediction score and a preset score classification threshold; wherein the pseudo label is a binary label;
judging whether the label of the pseudo label is the same as that of the binary label data;
and under the condition that the labels of the pseudo label and the binary label data are different, determining whether the binary label data are wrong label data or not based on a preset first label score deviation threshold value and a preset second label score deviation threshold value.
3. The method as claimed in claim 2, wherein the generating a pseudo label for the two-class label data based on the type prediction score and a preset score classification threshold specifically comprises:
determining whether the type prediction score is less than the score classification threshold;
generating a pseudo label of a first label type for the two classification label data if the type prediction score is less than the score classification threshold; wherein the first tag type is a "0" tag;
generating a pseudo label of a second label type for the second classification label data if the type prediction score is not less than the score classification threshold; wherein the second tag type is a "1" tag.
4. The method for optimizing data of a binary label based on a pseudo label according to claim 2, wherein determining whether the data of the binary label is false label data based on a preset first label score deviation threshold and a preset second label score deviation threshold specifically includes:
if the label of the two-classification label data is of a first label type and the pseudo label is of a second label type, determining that the two-classification label data is false label data if the type prediction score is greater than the first label score deviation threshold;
and under the condition that the label of the two-classification label data is of a second label type and the pseudo label is of a first label type, if the type prediction score is smaller than the second label score deviation threshold, determining that the two-classification label data is false label data.
5. The pseudo tag-based two-class tag data optimization method according to claim 1, wherein the dividing of the two-class tag data sets to be optimized into a preset number of optimization sets specifically comprises:
determining a first quantity corresponding to first tag type data and a second quantity corresponding to second tag type data in the to-be-optimized classified tag data set, and determining whether the absolute value of the difference value between the first quantity and the second quantity is smaller than a first preset threshold value;
under the condition that the absolute value of the difference value between the first quantity and the second quantity is smaller than a first preset threshold value, dividing the to-be-optimized classified label data set into a preset quantity of optimized sets with the same quantity of label data; wherein an absolute value of a difference between a third amount of the first tag data and a fourth amount of the second tag data in the optimized set is smaller than a second preset threshold.
6. The pseudo tag-based binary tag data optimization method according to claim 5, further comprising:
and under the condition that the absolute value of the difference value between the first quantity and the second quantity is not smaller than a first preset threshold value, adjusting the quantity of first tag type data in the to-be-optimized classified tag data set, or adjusting the quantity of second tag type data in the to-be-optimized classified tag data set, so that the absolute value of the difference value between the first quantity corresponding to the first tag type data and the second quantity corresponding to the second tag type data is smaller than the first preset threshold value.
7. The method for optimizing data of a binary class label based on a pseudo label according to claim 1, wherein the determining a predetermined number of training sets based on the predetermined number of optimization sets specifically comprises:
determining any one of the preset number of optimization sets as an optimization set as the corresponding optimization set, and forming other optimization sets except the corresponding optimization set into a training set corresponding to the corresponding optimization set;
and traversing the preset number of optimization sets to obtain a preset number of training sets corresponding to each optimization set.
8. The pseudo-label based binary label data optimization method according to claim 1, wherein before training a model to be trained, the method further comprises:
and adding a verification module in the model to be trained, so that the initial model obtained through training can determine the type prediction score of each piece of classified label data when the corresponding optimization set is verified.
9. A pseudo-tag based binary tag data optimization apparatus, the apparatus comprising:
a processor;
and a memory having executable code stored thereon, which when executed, causes the processor to perform a method as claimed in any one of claims 1-8.
10. A non-transitory computer storage medium optimized for pseudo tag based binary tag data, storing computer-executable instructions configured to:
dividing two classification label data sets to be optimized into a preset number of optimization sets;
determining a preset number of training sets based on the preset number of optimization sets, and training the model to be trained to obtain a preset number of initial models; wherein the training set comprises a preset number minus one optimization set;
verifying the corresponding optimization sets respectively through the preset number of initial models to determine the type prediction scores of the two-class label data in the corresponding optimization sets; wherein, the corresponding optimization set is an optimization set which is not adopted when the initial model is obtained through training;
and determining whether the corresponding two-classification label data is wrong label data or not through a preset evaluation rule based on the type prediction score.
CN202111663474.0A 2021-12-30 2021-12-30 Pseudo tag-based binary label data optimization method, equipment and medium Active CN114330618B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111663474.0A CN114330618B (en) 2021-12-30 2021-12-30 Pseudo tag-based binary label data optimization method, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111663474.0A CN114330618B (en) 2021-12-30 2021-12-30 Pseudo tag-based binary label data optimization method, equipment and medium

Publications (2)

Publication Number Publication Date
CN114330618A true CN114330618A (en) 2022-04-12
CN114330618B CN114330618B (en) 2024-07-02

Family

ID=81021795

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111663474.0A Active CN114330618B (en) 2021-12-30 2021-12-30 Pseudo tag-based binary label data optimization method, equipment and medium

Country Status (1)

Country Link
CN (1) CN114330618B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111666502A (en) * 2020-07-08 2020-09-15 腾讯科技(深圳)有限公司 Abnormal user identification method and device based on deep learning and storage medium
CN112149733A (en) * 2020-09-23 2020-12-29 北京金山云网络技术有限公司 Model training method, model training device, quality determining method, quality determining device, electronic equipment and storage medium
CN112541542A (en) * 2020-12-11 2021-03-23 第四范式(北京)技术有限公司 Method and device for processing multi-classification sample data and computer readable storage medium
US20210089964A1 (en) * 2019-09-20 2021-03-25 Google Llc Robust training in the presence of label noise
KR20210085158A (en) * 2019-12-30 2021-07-08 한국과학기술원 Method and apparatus for recognizing named entity considering context
CN113378895A (en) * 2021-05-24 2021-09-10 成都欧珀通信科技有限公司 Classification model generation method and device, storage medium and electronic equipment
CN113420707A (en) * 2021-07-05 2021-09-21 神思电子技术股份有限公司 Video target detection method based on weak supervised learning
CN113822374A (en) * 2021-10-29 2021-12-21 平安科技(深圳)有限公司 Model training method, system, terminal and storage medium based on semi-supervised learning

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210089964A1 (en) * 2019-09-20 2021-03-25 Google Llc Robust training in the presence of label noise
KR20210085158A (en) * 2019-12-30 2021-07-08 한국과학기술원 Method and apparatus for recognizing named entity considering context
CN111666502A (en) * 2020-07-08 2020-09-15 腾讯科技(深圳)有限公司 Abnormal user identification method and device based on deep learning and storage medium
CN112149733A (en) * 2020-09-23 2020-12-29 北京金山云网络技术有限公司 Model training method, model training device, quality determining method, quality determining device, electronic equipment and storage medium
CN112541542A (en) * 2020-12-11 2021-03-23 第四范式(北京)技术有限公司 Method and device for processing multi-classification sample data and computer readable storage medium
CN113378895A (en) * 2021-05-24 2021-09-10 成都欧珀通信科技有限公司 Classification model generation method and device, storage medium and electronic equipment
CN113420707A (en) * 2021-07-05 2021-09-21 神思电子技术股份有限公司 Video target detection method based on weak supervised learning
CN113822374A (en) * 2021-10-29 2021-12-21 平安科技(深圳)有限公司 Model training method, system, terminal and storage medium based on semi-supervised learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ISAAC TRIGUERO ET.AL: "On the characterization of noise filters for self-training semi-supervised in nearest neighbor classification", 《NEUROCOMPUTING》, vol. 132, 12 November 2013 (2013-11-12), pages 30 - 41, XP028662780, DOI: 10.1016/j.neucom.2013.05.055 *
邓蔚;邢钰晗;李逸凡;李振华;王国胤;: "公平性机器学习研究综述", 智能系统学报, no. 03, 29 July 2020 (2020-07-29), pages 172 - 180 *

Also Published As

Publication number Publication date
CN114330618B (en) 2024-07-02

Similar Documents

Publication Publication Date Title
CN110348580B (en) Method and device for constructing GBDT model, and prediction method and device
CN111080304B (en) Credible relationship identification method, device and equipment
CN110119860A (en) A kind of rubbish account detection method, device and equipment
US20210263979A1 (en) Method, system and device for identifying crawler data
CN110633989A (en) Method and device for determining risk behavior generation model
CN111401766B (en) Model, service processing method, device and equipment
CN110968689A (en) Training method of criminal name and law bar prediction model and criminal name and law bar prediction method
CN111368163B (en) Crawler data identification method, system and equipment
CN108229564B (en) Data processing method, device and equipment
CN116151236A (en) Training method of text processing model, text processing method and related equipment
CN111258905B (en) Defect positioning method and device, electronic equipment and computer readable storage medium
CN117609781A (en) Training method of text evaluation model, text evaluation method and device
CN117409419A (en) Image detection method, device and storage medium
CN114338413A (en) Method and device for determining topological relation of equipment in network and storage medium
CN115712866A (en) Data processing method, device and equipment
CN114077859A (en) Abnormal sample detection method and device, electronic device and storage medium
CN114254588B (en) Data tag processing method and device
CN110163470B (en) Event evaluation method and device
CN114330618B (en) Pseudo tag-based binary label data optimization method, equipment and medium
CN116028626A (en) Text matching method and device, storage medium and electronic equipment
CN111741526B (en) Positioning method, positioning device, electronic equipment and computer storage medium
CN114912513A (en) Model training method, information identification method and device
CN110458393B (en) Method and device for determining risk identification scheme and electronic equipment
CN110543549B (en) Semantic equivalence judgment method and device
CN114036283A (en) Text matching method, device, equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant