CN112749150A

CN112749150A - Method, device and medium for identifying error marking data

Info

Publication number: CN112749150A
Application number: CN201911055046.2A
Authority: CN
Inventors: 刘睿; 靳丁南; 罗欢; 权圣
Original assignee: Beijing Zhongguancun Kejin Technology Co Ltd
Current assignee: Beijing Zhongguancun Kejin Technology Co Ltd
Priority date: 2019-10-31
Filing date: 2019-10-31
Publication date: 2021-05-04
Anticipated expiration: 2039-10-31
Also published as: CN112749150B

Abstract

The application discloses a method, a device and a storage medium for identifying error labeling data, wherein the method comprises the following steps: acquiring to-be-audited marked data of a current batch, and determining error marked data in the to-be-audited marked data according to a preset category statistical index and a preset marked data auditing model; and the labeled data auditing model is obtained by utilizing the labeled data of the last batch of the current batch to update and train. Through the embodiment, the quality inspection working efficiency of the marked data can be improved.

Description

Method, device and medium for identifying error marking data

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, and a medium for identifying mislabeled data.

Background

Along with the development of artificial intelligence technique, the demand of the label data that is honored as artificial intelligence field "grain" is more and more, and the label personnel carry out the label back to data, need the professional to carry out the step of artifical audit and carry out quality control, and the data that the quality control result needs the label reach the exactness of requirement, and the data of this label just calculates qualified.

In the existing quality inspection step of the marked data, part of marked data is extracted from a batch of marked data for auditing, the accuracy of the audited marked data is calculated, if the accuracy does not reach the standard, the quality inspection of the batch of marked data is judged to be unqualified, and then marking personnel are required to mark the batch of marked data again until the quality inspection is qualified, so that more workload is required, and the problem of low quality inspection efficiency of the marked data is caused.

The embodiment of the disclosure provides a method, a device and a medium for identifying wrong labeling data, so as to improve the quality inspection working efficiency of the labeling data.

Disclosure of Invention

The embodiment of the disclosure provides a method and a device for identifying error labeling data and a storage medium, which can improve the quality inspection working efficiency of the labeling data.

To solve the above technical problem, the embodiment of the present invention is implemented as follows:

in a first aspect, an embodiment of the present disclosure provides a method for identifying error marked data, including:

acquiring to-be-audited marking data of a current batch; and

determining error marking data in the to-be-checked marking data according to a preset category statistical index and a preset marking data checking model; and the labeled data auditing model is obtained by utilizing the labeled data of the last batch of the current batch to update and train.

In a second aspect, the disclosed embodiments further provide a storage medium, which includes a stored program, wherein the processor executes the method for identifying the mis-labeled data according to the first aspect when the program runs

In a third aspect, there is further provided an apparatus for identifying incorrectly labeled data according to an embodiment of the present disclosure, including:

the marking data acquisition module is used for acquiring marking data to be audited of the current batch; and

the error data determining module is used for determining error marking data in the to-be-checked marking data according to a preset category statistical index and a preset marking data checking model; and the labeled data auditing model is obtained by utilizing the labeled data of the last batch of the current batch to update and train.

In a fourth aspect, an embodiment of the present disclosure further provides an apparatus for identifying incorrectly labeled data, including:

a processor; and

a memory coupled to the first processor for providing instructions to the first processor to process the following process steps:

acquiring to-be-audited marking data of a current batch; and

In the embodiment of the invention, to-be-audited marking data of a current batch are obtained, and error marking data are determined in the to-be-audited marking data according to a preset category statistical index and a preset marking data auditing model; the annotation data auditing model is obtained by utilizing the annotation data of the last batch of current batch to be annotated for updating and training. In the embodiment, the error labeling data is determined in the to-be-verified labeling data through the preset category statistical index and the preset labeling data verification model, so that the labeling personnel can directly re-label the determined error labeling data, and the quality inspection work efficiency of the labeling data is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the disclosure and together with the description serve to explain the disclosure and not to limit the disclosure. In the drawings:

fig. 1 is a block diagram of a hardware structure of a computing device for implementing a method for identifying mislabeled data according to an embodiment of the present disclosure;

FIG. 2 is a flowchart illustrating a method for identifying incorrect annotation data according to an embodiment of the disclosure;

FIG. 3 is a schematic diagram of an apparatus for identifying mislabeling data according to an embodiment of the present disclosure;

fig. 4 is a schematic diagram of an apparatus for identifying mislabeling data according to another embodiment of the present disclosure.

Detailed Description

In order to make those skilled in the art better understand the technical solutions of the present disclosure, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure. It is to be understood that the described embodiments are merely exemplary of some, and not all, of the present disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

According to the present embodiment, there is also provided an embodiment of a method for identifying mislabeling data, it should be noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer executable instructions, and that while a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.

The method embodiments provided by the present embodiment may be executed in a mobile terminal, a computer terminal, a server or a similar computing device. Fig. 1 shows a hardware block diagram of a computing device for implementing a method for identifying mislabeled data. As shown in fig. 1, the computing device may include one or more processors (which may include, but are not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), a memory for storing data, and a transmission device for communication functions. Besides, the method can also comprise the following steps: a display, an input/output interface (I/O interface), a Universal Serial Bus (USB) port (which may be included as one of the ports of the I/O interface), a network interface, a power source, and/or a camera. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration and is not intended to limit the structure of the electronic device. For example, the computing device may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

It should be noted that the one or more processors and/or other data processing circuitry described above may be referred to generally herein as "data processing circuitry". The data processing circuitry may be embodied in whole or in part in software, hardware, firmware, or any combination thereof. Further, the data processing circuitry may be a single, stand-alone processing module, or incorporated in whole or in part into any of the other elements in the computing device. As referred to in the disclosed embodiments, the data processing circuit acts as a processor control (e.g., selection of a variable resistance termination path connected to the interface).

The memory may be used to store software programs and modules of application software, such as program instructions/data storage devices corresponding to the identification method of error marking data in the embodiments of the present disclosure, and the processor executes various functional applications and data processing by running the software programs and modules stored in the memory, that is, implementing the above-mentioned identification method of error marking data of the application program. The memory may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory may further include memory located remotely from the processor, which may be connected to the computing device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device is used for receiving or transmitting data via a network. Specific examples of such networks may include wireless networks provided by communication providers of the computing devices. In one example, the transmission device includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission device may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the computing device.

It should be noted here that in some alternative embodiments, the computing device shown in fig. 1 described above may include hardware elements (including circuitry), software elements (including computer code stored on a computer-readable medium), or a combination of both hardware and software elements. It should be noted that FIG. 1 is only one example of a particular specific example and is intended to illustrate the types of components that may be present in a computing device as described above.

In the above operating environment, the embodiment provides a method for identifying error marked data. Fig. 2 is a schematic flow chart of a method for identifying incorrect annotation data according to an embodiment of the disclosure, and referring to fig. 2, the method includes:

s202: acquiring to-be-audited marking data of a current batch;

s204: determining error marking data in the to-be-checked marking data according to a preset category statistical index and a preset marking data checking model; and the labeled data auditing model is obtained by utilizing the labeled data of the last batch of the current batch to update and train.

In the step S202, to-be-audited annotation data of the current batch is obtained, the to-be-audited annotation data is data that needs to be audited again to achieve the accuracy of the quality inspection requirement after various data are annotated by the annotator, the data type of the to-be-audited annotation data includes image, video, audio, text, and the like, and may be other types of data, and no special limitation is imposed here. For example, two pieces of audio data are labeled as a zebra cry and a lion sounding, and the two pieces of labeled audio data are to-be-audited labeled data.

In the above operation S204, determining, according to the preset category statistical index and the preset annotation data audit model, that the error annotation data in the to-be-audited annotation data is determined according to the preset category statistical index and the preset annotation data audit model; the annotation data auditing model is obtained by utilizing the annotation data of the last batch of current batch to be annotated for updating and training. In an optimal embodiment, labeling work is carried out on the same type of data, and when the labeled data of the current batch is examined and checked, the used preset type statistical index and the preset labeled data examination model are both obtained by using the labeled data of the previous batch of the current batch, so that the referential of the type statistical index can be improved, the accuracy of the preset labeled data examination model is also improved, and the quality inspection work efficiency of the labeled data is improved.

Further, determining error labeling data in the to-be-verified labeling data according to a preset category statistical index and a preset labeling data verification model, including:

(a1) acquiring class marking error rate of the data to be checked; wherein the class labeling error rate is determined according to the labeling data labeled in the previous batch;

(a2) determining the estimated error rate of the marked data to be checked according to a preset marked data checking model;

(a3) determining the error probability value of the labeled data to be audited according to the class label error rate and the estimated error rate;

(a4) and determining the error labeling data in the labeling data to be audited according to the error probability value and the first preset threshold value.

In the action (a1), acquiring the class marking error rate of the data to be checked; the class labeling error rate is determined according to the labeled data of the previous batch, wherein the previous batch is the current batch relative to the acquired data to be checked, or can be determined according to the historical labeling data sets labeled for all the batches before the class labeling data of the current batch, the historical labeling data sets are classified, the labeling error rate of each class of historical labeling data sets during quality inspection and checking is acquired, the class labeling error rate of each class of historical labeling data is obtained, and the class labeling error rate of the class, which is the same as the class corresponding to the data to be checked, in all the classes of the historical labeling data sets is determined as the class labeling error rate of the nuclear data to be checked.

For example, the history labeled data set is classified to obtain A, B, C three types of labeled data, the error labeling rates corresponding to the three types of labeled data are a1, B1 and C1, respectively, and the type of the obtained to-be-audited data is a, so that the class error labeling rate of the to-be-audited data is a 1.

In the above action (a2), the estimated error rate of the to-be-checked labeled data is determined according to the preset labeled data auditing model, and the to-be-checked labeled data is substituted into the preset labeled data auditing model to obtain the estimated error rate of the to-be-checked labeled data, for example, the labeled data of a picture is substituted into the preset labeled data auditing model to obtain the estimated error rate of the picture as 0.6.

In the above-mentioned action (a3), the error probability value of the labeled data to be audited is determined according to the class label error rate and the estimated error rate. That is to say, the error probability value of the labeled data to be audited is determined by the class labeling error rate of the labeled data to be audited and the corresponding estimated error rate.

In the action (a4), the error marked data in the marked data to be checked is determined according to the error probability value and the first preset threshold, and the marked data to be checked, which has the error probability value greater than the first preset threshold, is determined as the error marked data, where the first preset threshold may be set to 0.5 or 0.8, and no special limitation is made here. For example, if the error probability value of the to-be-audited labeled data is 0.9 and the first preset threshold value is 0.8, it is determined that the to-be-audited labeled data is the error labeled data.

Further, determining the error probability value of the to-be-audited annotation data according to the class annotation error rate and the estimated error rate includes:

(b1) normalizing the class labeling error rate, and determining a class labeling error index according to the processed class labeling error rate;

(b2) and determining the product of the class labeling error index and the estimated error rate as the error probability value of the to-be-audited labeling data.

In the above-described operation (b1), the class labeling error rate is normalized, and the class labeling error index is determined based on the processed class labeling error rate. The specific step of normalizing the class label error rate of the labeled data to be audited includes finding out a class label error rate value Emax with a maximum value and a class label error rate value Emin with a minimum value from the class label error rates of each type of historical labeled data in the above action (a1), and assuming that the class label error rate of the labeled data to be audited is E, the calculation formula of the normalized value X of E is: x ═ E-Emin)/(Emax-Emin). And then, determining a class labeling error index according to the processed class labeling processing error rate.

In the above-mentioned action (b2), the product of the class labeling error index and the estimated error rate is determined as the error probability value of the labeling data to be checked. For example, if the class label error index is 0.8 and the estimated error rate is 0.6, the error probability value of the labeled data to be audited is 0.8 × 0.6 — 0.48.

Further, the step of normalizing the class labeling error rate and determining a class labeling error index according to the processed class labeling error rate includes:

(c1) if the processed class labeling error rate is greater than a second preset threshold, determining the processed class labeling error rate as a class labeling error index; or

(c2) And if the processed class marking error rate is less than or equal to a second preset threshold, determining the second preset threshold as a class marking error index.

In the above action (c1), if the processed class labeling error rate is greater than the second preset threshold, the processed class labeling error rate is determined as the class labeling error index, and if the normalized class labeling error rate is greater than the second preset threshold, the second preset threshold value range is (0,1), and in a preferred embodiment, the second preset threshold value range is 0.5, the normalized class labeling error rate is determined as the class labeling error index.

In the above operation (c2), if the processed class label error rate is less than or equal to the second preset threshold, the second preset threshold is determined as the class label error index, for example, if the second threshold is 0.6, and the normalized class label error rate is 0.5, the class label error index is 0.6.

Further, determining error labeling data in the to-be-audited labeling data according to the error probability value and a first preset threshold value, wherein the method comprises the following steps:

(d1) and determining the to-be-audited marking data with the error probability value larger than the first preset threshold value as error marking data.

In the above action (d1), the to-be-audited annotation data with the error probability value greater than the first preset threshold is determined as the error annotation data, where a value range of the first preset threshold is (0,1), and in a preferred embodiment, a value of the first preset threshold is 0.6.

Further, before acquiring the annotation data to be audited, the method includes:

(e1) acquiring historical checked and labeled data, and classifying the historical checked and labeled data;

(e2) determining a class center vector corresponding to each type of audited data according to the classified historical audited marking data;

(e3) and training the constructed annotation data audit model according to the historical audited annotation data and the corresponding class center vector to obtain a preset annotation data audit model.

In the above-mentioned operation (e1), the history checked and labeled data is acquired, and the history checked and labeled data is classified. The historical checked and labeled data may be the checked and labeled data of the last batch of the labeled data to be checked and labeled of the current batch, or may be the checked and labeled data preset in the database, and no special limitation is made here. And classifying the historical checked and labeled data, namely classifying the historical checked and labeled data according to certain attributes, for example, classifying a batch of checked and labeled data into animal labeled data and plant labeled data according to category attributes.

In the above-mentioned act (e2), a class center vector corresponding to each type of the inspected data is determined according to the classified historical inspected labeled data, and a class center vector corresponding to each type of the inspected data is determined according to a two-classification method.

In the action (e3), the constructed annotation data audit model is trained according to the historical audited annotation data and the corresponding class center vector, so as to obtain a preset annotation data audit model. The constructed annotation data auditing model is obtained by training according to preset annotation data in a supervision machine learning module, the specific process is to screen a part of annotation data in the preset annotation data for auditing and labeling, namely the alignment of the preset annotation data is based on auditing labels, the part of the annotated data after auditing comprises annotation information and auditing information, and the annotation information and the auditing information are substituted into the supervision machine learning module for training to obtain the constructed annotation data.

Specifically, the supervised machine learning module may be one or more of a convolutional neural network, a binary vector algorithm, a deep neural network, and a logistic regression algorithm, and is not particularly limited herein.

Further, training the constructed annotation data audit model according to the historical audited annotation data and the corresponding class center vector to obtain a preset annotation data audit model, and the method comprises the following steps:

(f1) acquiring the marking information and the auditing information of the historical audited marking data, determining the same information or different information of the historical audited marking data according to the marking information and the auditing information, and using the same information or different information as a first data group;

(f2) vectorizing the historical checked and labeled data, and taking the processed historical checked and labeled data and the center of class corresponding to the historical checked and labeled data as a second data group;

(f3) and substituting the first data group and the second data group into the constructed labeled data auditing model for training to obtain a preset labeled data auditing model.

In the above-mentioned operation (f1), the label information and the audit information of the history audited label data are obtained, the same information or different information of the history audited label data is determined according to the label information and the audit information, and the same information or different information is used as the first data. For example, if the label information of a segment of movie video is a war subject, the audit information is a science fiction subject, the same information does not exist, and if the different information is a war subject and a science fiction subject, the different information is used as the first data. Or for example, if the marked information of one picture is a elephant and the audit information is a elephant, no different information exists, and the same information is the elephant.

In the above actions (f2) and (f3), vectorization processing is performed on the historical checked and labeled data, the processed historical checked and labeled data and the class center amount corresponding to the historical checked and labeled data are used as a second data group, and the first data group and the second data group are substituted into the constructed labeled data check model for training, so as to obtain the preset labeled data check model. For example, the history audited and labeled data after vectorization processing is a1, and the class center vector corresponding to the history audited and labeled data is a, the second data group is (a1, a), and the second data group is B, the (a1, a) and the (B) are used as data groups to be substituted into the constructed labeled data audit model for training, so as to obtain a preset labeled data audit model, wherein the preset labeled audit model calculates the result as the estimated error rate of the labeled data.

In the embodiment of the invention, to-be-audited marking data of a current batch are obtained, error marking data are determined in the to-be-audited marking data according to a preset category statistical index and a preset marking data auditing model, and the error marking data are determined in the to-be-audited marking data according to the preset category statistical index and the preset marking data auditing model; the annotation data auditing model is obtained by utilizing the annotation data of the last batch of current batch to be annotated for updating and training. In the embodiment, the error labeling data is determined in the to-be-verified labeling data through the preset category statistical index and the preset labeling data verification model, so that the labeling personnel can directly re-label the determined error labeling data, and the quality inspection work efficiency of the labeling data is improved.

Further, referring to fig. 1, according to a second aspect of the present embodiment, there is provided a storage medium. The storage medium includes a stored program, wherein the method for identifying the error marking data is executed by a processor when the program runs.

Therefore, according to the embodiment, to-be-audited marked data of the current batch are obtained, and error marked data are determined in the to-be-audited marked data according to the preset category statistical index and the preset marked data auditing model; the annotation data auditing model is obtained by utilizing the annotation data of the last batch of current batch to be annotated for updating and training. In the embodiment, the error labeling data is determined in the to-be-verified labeling data through the preset category statistical index and the preset labeling data verification model, so that the labeling personnel can directly re-label the determined error labeling data, and the quality inspection work efficiency of the labeling data is improved.

The storage medium provided by the embodiment of the present application can implement the processes in the foregoing method embodiments, and achieve the same functions and effects, which are not repeated here.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

Example 2

Fig. 3 is a schematic diagram of an apparatus for identifying error marked data according to an embodiment of the present disclosure, where the apparatus 300 corresponds to a method for identifying error marked data according to embodiment 1. Referring to fig. 3, the apparatus 300 includes:

the annotation data acquisition module 301 is configured to acquire to-be-audited annotation data of a current batch;

an error data determination module 302, configured to determine error annotation data in the to-be-audited annotation data according to a preset category statistical indicator and a preset annotation data audit model; and the labeled data auditing model is obtained by utilizing the labeled data of the last batch of the current batch to update and train.

Optionally, the error data determination module 302 is specifically configured to:

acquiring the class marking error rate of the data to be audited; wherein the class labeling error rate is determined according to historical audited labeling data;

determining the estimated error rate of the to-be-audited annotation data according to the preset annotation data audit model;

determining the error probability value of the to-be-audited annotation data according to the class annotation error rate and the estimated error rate;

and determining the error labeling data in the labeling data to be audited according to the error probability value and a first preset threshold value.

Optionally, the error data determination module 302 is further specifically configured to:

carrying out normalization processing on the class labeling error rate, and determining a class labeling error index according to the processed class labeling error rate;

and determining the product of the class marking error index and the estimated error rate as the error probability value of the marking data to be audited.

if the processed class labeling error rate is larger than a second preset threshold value, determining the processed class labeling error rate as a class labeling error index; or

And if the processed class labeling error rate is less than or equal to a second preset threshold, determining the second preset threshold as a class labeling error index.

and determining the to-be-audited marking data with the error probability value larger than a first preset threshold value as error marking data.

Optionally, an auditing model module is included for, before obtaining the annotation data to be audited,

acquiring historical checked and labeled data, and classifying the historical checked and labeled data;

determining a class center vector corresponding to each type of the checked data according to the classified historical checked and labeled data;

and training the constructed annotation data audit model according to the historical audited annotation data and the corresponding class center vector to obtain the preset annotation data audit model.

Optionally, the audit model module is specifically configured to:

obtaining the marking information and the auditing information of the historical audited marking data, determining the same information or different information of the historical audited marking data according to the marking information and the auditing information, and taking the same information or the different information as a first data group;

vectorizing the historical checked and labeled data, and taking the processed historical checked and labeled data and the center quantity corresponding to the historical checked and labeled data as a second data group;

and substituting the first data group and the second data group into the constructed labeled data auditing model for training to obtain the preset labeled data auditing model.

The method and the device for identifying the wrong labeling data provided by the embodiment of the application can realize each process in the method embodiment and achieve the same functions and effects, and are not repeated here.

Example 3

Fig. 4 is a schematic diagram of an apparatus for identifying incorrect annotation data according to another embodiment of the disclosure, wherein the apparatus 400 corresponds to the method according to the first aspect of embodiment 1. Referring to fig. 4, the apparatus 400 includes: a processor 410; and a memory 420 coupled to the processor 410 for providing instructions to the processor 410 to process the following process steps:

acquiring to-be-audited marking data of a current batch; and

Optionally, determining error annotation data in the annotation data to be audited according to a preset category statistical index and a preset annotation data audit model, including:

Optionally, determining an error probability value of the to-be-audited annotation data according to the class annotation error rate and the estimated error rate includes:

Optionally, normalizing the class labeling error rate, and determining a class labeling error index according to the processed class labeling error rate, including:

Optionally, determining error tagging data in the tagging data to be audited according to the error probability value and a first preset threshold, including:

Optionally, before the obtaining of the annotation data to be audited, the method includes:

Optionally, training the constructed annotation data audit model according to the historical audited annotation data and the corresponding class center vector to obtain the preset annotation data audit model, including:

The device for identifying the wrong labeling data provided by the embodiment of the application can realize each process in the method embodiment and achieve the same function and effect, and the process is not repeated here.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method for identifying mislabeled data, comprising:

acquiring to-be-audited marking data of a current batch; and

2. The method according to claim 1, wherein determining the error labeled data in the labeled data to be checked according to a preset category statistical index and a preset labeled data checking model comprises:

acquiring the class marking error rate of the data to be audited; wherein the class labeling error rate is determined according to the labeling data of the previous batch of labels;

3. The method of claim 2, wherein determining the error probability value of the to-be-audited annotation data according to the class annotation error rate and the estimated error rate comprises:

4. The method of claim 3, wherein normalizing the class labeling error rate and determining a class labeling error index according to the processed class labeling error rate comprises:

5. The method according to claim 2, wherein determining the error annotation data in the annotation data to be audited according to the error probability value and a first preset threshold comprises:

6. The method according to claim 1, wherein before acquiring the annotation data to be reviewed, the method comprises:

7. The method of claim 6, wherein training the constructed annotation data audit model according to the historical audited annotation data and the corresponding class center vector to obtain the preset annotation data audit model comprises:

8. A storage medium comprising a stored program, wherein the method of identifying mislabelling data as claimed in any of claims 1 to 7 is performed by a processor when the program is run.

9. An apparatus for identifying mislabeled data, comprising:

10. An apparatus for identifying mislabeled data, comprising:

a processor; and

acquiring to-be-audited marking data of a current batch; and