CN112749150A - Method, device and medium for identifying error marking data - Google Patents

Method, device and medium for identifying error marking data Download PDF

Info

Publication number
CN112749150A
CN112749150A CN201911055046.2A CN201911055046A CN112749150A CN 112749150 A CN112749150 A CN 112749150A CN 201911055046 A CN201911055046 A CN 201911055046A CN 112749150 A CN112749150 A CN 112749150A
Authority
CN
China
Prior art keywords
data
error
audited
marking
class
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911055046.2A
Other languages
Chinese (zh)
Other versions
CN112749150B (en
Inventor
刘睿
靳丁南
罗欢
权圣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhongguancun Kejin Technology Co Ltd
Original Assignee
Beijing Zhongguancun Kejin Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhongguancun Kejin Technology Co Ltd filed Critical Beijing Zhongguancun Kejin Technology Co Ltd
Priority to CN201911055046.2A priority Critical patent/CN112749150B/en
Publication of CN112749150A publication Critical patent/CN112749150A/en
Application granted granted Critical
Publication of CN112749150B publication Critical patent/CN112749150B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Automatic Analysis And Handling Materials Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application discloses a method, a device and a storage medium for identifying error labeling data, wherein the method comprises the following steps: acquiring to-be-audited marked data of a current batch, and determining error marked data in the to-be-audited marked data according to a preset category statistical index and a preset marked data auditing model; and the labeled data auditing model is obtained by utilizing the labeled data of the last batch of the current batch to update and train. Through the embodiment, the quality inspection working efficiency of the marked data can be improved.

Description

Method, device and medium for identifying error marking data
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, and a medium for identifying mislabeled data.
Background
Along with the development of artificial intelligence technique, the demand of the label data that is honored as artificial intelligence field "grain" is more and more, and the label personnel carry out the label back to data, need the professional to carry out the step of artifical audit and carry out quality control, and the data that the quality control result needs the label reach the exactness of requirement, and the data of this label just calculates qualified.
In the existing quality inspection step of the marked data, part of marked data is extracted from a batch of marked data for auditing, the accuracy of the audited marked data is calculated, if the accuracy does not reach the standard, the quality inspection of the batch of marked data is judged to be unqualified, and then marking personnel are required to mark the batch of marked data again until the quality inspection is qualified, so that more workload is required, and the problem of low quality inspection efficiency of the marked data is caused.
The embodiment of the disclosure provides a method, a device and a medium for identifying wrong labeling data, so as to improve the quality inspection working efficiency of the labeling data.
Disclosure of Invention
The embodiment of the disclosure provides a method and a device for identifying error labeling data and a storage medium, which can improve the quality inspection working efficiency of the labeling data.
To solve the above technical problem, the embodiment of the present invention is implemented as follows:
in a first aspect, an embodiment of the present disclosure provides a method for identifying error marked data, including:
acquiring to-be-audited marking data of a current batch; and
determining error marking data in the to-be-checked marking data according to a preset category statistical index and a preset marking data checking model; and the labeled data auditing model is obtained by utilizing the labeled data of the last batch of the current batch to update and train.
In a second aspect, the disclosed embodiments further provide a storage medium, which includes a stored program, wherein the processor executes the method for identifying the mis-labeled data according to the first aspect when the program runs
In a third aspect, there is further provided an apparatus for identifying incorrectly labeled data according to an embodiment of the present disclosure, including:
the marking data acquisition module is used for acquiring marking data to be audited of the current batch; and
the error data determining module is used for determining error marking data in the to-be-checked marking data according to a preset category statistical index and a preset marking data checking model; and the labeled data auditing model is obtained by utilizing the labeled data of the last batch of the current batch to update and train.
In a fourth aspect, an embodiment of the present disclosure further provides an apparatus for identifying incorrectly labeled data, including:
a processor; and
a memory coupled to the first processor for providing instructions to the first processor to process the following process steps:
acquiring to-be-audited marking data of a current batch; and
determining error marking data in the to-be-checked marking data according to a preset category statistical index and a preset marking data checking model; and the labeled data auditing model is obtained by utilizing the labeled data of the last batch of the current batch to update and train.
In the embodiment of the invention, to-be-audited marking data of a current batch are obtained, and error marking data are determined in the to-be-audited marking data according to a preset category statistical index and a preset marking data auditing model; the annotation data auditing model is obtained by utilizing the annotation data of the last batch of current batch to be annotated for updating and training. In the embodiment, the error labeling data is determined in the to-be-verified labeling data through the preset category statistical index and the preset labeling data verification model, so that the labeling personnel can directly re-label the determined error labeling data, and the quality inspection work efficiency of the labeling data is improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the disclosure and together with the description serve to explain the disclosure and not to limit the disclosure. In the drawings:
fig. 1 is a block diagram of a hardware structure of a computing device for implementing a method for identifying mislabeled data according to an embodiment of the present disclosure;
FIG. 2 is a flowchart illustrating a method for identifying incorrect annotation data according to an embodiment of the disclosure;
FIG. 3 is a schematic diagram of an apparatus for identifying mislabeling data according to an embodiment of the present disclosure;
fig. 4 is a schematic diagram of an apparatus for identifying mislabeling data according to another embodiment of the present disclosure.
Detailed Description
In order to make those skilled in the art better understand the technical solutions of the present disclosure, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure. It is to be understood that the described embodiments are merely exemplary of some, and not all, of the present disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Example 1
According to the present embodiment, there is also provided an embodiment of a method for identifying mislabeling data, it should be noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer executable instructions, and that while a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.
The method embodiments provided by the present embodiment may be executed in a mobile terminal, a computer terminal, a server or a similar computing device. Fig. 1 shows a hardware block diagram of a computing device for implementing a method for identifying mislabeled data. As shown in fig. 1, the computing device may include one or more processors (which may include, but are not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), a memory for storing data, and a transmission device for communication functions. Besides, the method can also comprise the following steps: a display, an input/output interface (I/O interface), a Universal Serial Bus (USB) port (which may be included as one of the ports of the I/O interface), a network interface, a power source, and/or a camera. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration and is not intended to limit the structure of the electronic device. For example, the computing device may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.
It should be noted that the one or more processors and/or other data processing circuitry described above may be referred to generally herein as "data processing circuitry". The data processing circuitry may be embodied in whole or in part in software, hardware, firmware, or any combination thereof. Further, the data processing circuitry may be a single, stand-alone processing module, or incorporated in whole or in part into any of the other elements in the computing device. As referred to in the disclosed embodiments, the data processing circuit acts as a processor control (e.g., selection of a variable resistance termination path connected to the interface).
The memory may be used to store software programs and modules of application software, such as program instructions/data storage devices corresponding to the identification method of error marking data in the embodiments of the present disclosure, and the processor executes various functional applications and data processing by running the software programs and modules stored in the memory, that is, implementing the above-mentioned identification method of error marking data of the application program. The memory may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory may further include memory located remotely from the processor, which may be connected to the computing device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission device is used for receiving or transmitting data via a network. Specific examples of such networks may include wireless networks provided by communication providers of the computing devices. In one example, the transmission device includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission device may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.
The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the computing device.
It should be noted here that in some alternative embodiments, the computing device shown in fig. 1 described above may include hardware elements (including circuitry), software elements (including computer code stored on a computer-readable medium), or a combination of both hardware and software elements. It should be noted that FIG. 1 is only one example of a particular specific example and is intended to illustrate the types of components that may be present in a computing device as described above.
In the above operating environment, the embodiment provides a method for identifying error marked data. Fig. 2 is a schematic flow chart of a method for identifying incorrect annotation data according to an embodiment of the disclosure, and referring to fig. 2, the method includes:
s202: acquiring to-be-audited marking data of a current batch;
s204: determining error marking data in the to-be-checked marking data according to a preset category statistical index and a preset marking data checking model; and the labeled data auditing model is obtained by utilizing the labeled data of the last batch of the current batch to update and train.
In the embodiment of the invention, to-be-audited marking data of a current batch are obtained, and error marking data are determined in the to-be-audited marking data according to a preset category statistical index and a preset marking data auditing model; the annotation data auditing model is obtained by utilizing the annotation data of the last batch of current batch to be annotated for updating and training. In the embodiment, the error labeling data is determined in the to-be-verified labeling data through the preset category statistical index and the preset labeling data verification model, so that the labeling personnel can directly re-label the determined error labeling data, and the quality inspection work efficiency of the labeling data is improved.
In the step S202, to-be-audited annotation data of the current batch is obtained, the to-be-audited annotation data is data that needs to be audited again to achieve the accuracy of the quality inspection requirement after various data are annotated by the annotator, the data type of the to-be-audited annotation data includes image, video, audio, text, and the like, and may be other types of data, and no special limitation is imposed here. For example, two pieces of audio data are labeled as a zebra cry and a lion sounding, and the two pieces of labeled audio data are to-be-audited labeled data.
In the above operation S204, determining, according to the preset category statistical index and the preset annotation data audit model, that the error annotation data in the to-be-audited annotation data is determined according to the preset category statistical index and the preset annotation data audit model; the annotation data auditing model is obtained by utilizing the annotation data of the last batch of current batch to be annotated for updating and training. In an optimal embodiment, labeling work is carried out on the same type of data, and when the labeled data of the current batch is examined and checked, the used preset type statistical index and the preset labeled data examination model are both obtained by using the labeled data of the previous batch of the current batch, so that the referential of the type statistical index can be improved, the accuracy of the preset labeled data examination model is also improved, and the quality inspection work efficiency of the labeled data is improved.
Further, determining error labeling data in the to-be-verified labeling data according to a preset category statistical index and a preset labeling data verification model, including:
(a1) acquiring class marking error rate of the data to be checked; wherein the class labeling error rate is determined according to the labeling data labeled in the previous batch;
(a2) determining the estimated error rate of the marked data to be checked according to a preset marked data checking model;
(a3) determining the error probability value of the labeled data to be audited according to the class label error rate and the estimated error rate;
(a4) and determining the error labeling data in the labeling data to be audited according to the error probability value and the first preset threshold value.
In the action (a1), acquiring the class marking error rate of the data to be checked; the class labeling error rate is determined according to the labeled data of the previous batch, wherein the previous batch is the current batch relative to the acquired data to be checked, or can be determined according to the historical labeling data sets labeled for all the batches before the class labeling data of the current batch, the historical labeling data sets are classified, the labeling error rate of each class of historical labeling data sets during quality inspection and checking is acquired, the class labeling error rate of each class of historical labeling data is obtained, and the class labeling error rate of the class, which is the same as the class corresponding to the data to be checked, in all the classes of the historical labeling data sets is determined as the class labeling error rate of the nuclear data to be checked.
For example, the history labeled data set is classified to obtain A, B, C three types of labeled data, the error labeling rates corresponding to the three types of labeled data are a1, B1 and C1, respectively, and the type of the obtained to-be-audited data is a, so that the class error labeling rate of the to-be-audited data is a 1.
In the above action (a2), the estimated error rate of the to-be-checked labeled data is determined according to the preset labeled data auditing model, and the to-be-checked labeled data is substituted into the preset labeled data auditing model to obtain the estimated error rate of the to-be-checked labeled data, for example, the labeled data of a picture is substituted into the preset labeled data auditing model to obtain the estimated error rate of the picture as 0.6.
In the above-mentioned action (a3), the error probability value of the labeled data to be audited is determined according to the class label error rate and the estimated error rate. That is to say, the error probability value of the labeled data to be audited is determined by the class labeling error rate of the labeled data to be audited and the corresponding estimated error rate.
In the action (a4), the error marked data in the marked data to be checked is determined according to the error probability value and the first preset threshold, and the marked data to be checked, which has the error probability value greater than the first preset threshold, is determined as the error marked data, where the first preset threshold may be set to 0.5 or 0.8, and no special limitation is made here. For example, if the error probability value of the to-be-audited labeled data is 0.9 and the first preset threshold value is 0.8, it is determined that the to-be-audited labeled data is the error labeled data.
Further, determining the error probability value of the to-be-audited annotation data according to the class annotation error rate and the estimated error rate includes:
(b1) normalizing the class labeling error rate, and determining a class labeling error index according to the processed class labeling error rate;
(b2) and determining the product of the class labeling error index and the estimated error rate as the error probability value of the to-be-audited labeling data.
In the above-described operation (b1), the class labeling error rate is normalized, and the class labeling error index is determined based on the processed class labeling error rate. The specific step of normalizing the class label error rate of the labeled data to be audited includes finding out a class label error rate value Emax with a maximum value and a class label error rate value Emin with a minimum value from the class label error rates of each type of historical labeled data in the above action (a1), and assuming that the class label error rate of the labeled data to be audited is E, the calculation formula of the normalized value X of E is: x ═ E-Emin)/(Emax-Emin). And then, determining a class labeling error index according to the processed class labeling processing error rate.
In the above-mentioned action (b2), the product of the class labeling error index and the estimated error rate is determined as the error probability value of the labeling data to be checked. For example, if the class label error index is 0.8 and the estimated error rate is 0.6, the error probability value of the labeled data to be audited is 0.8 × 0.6 — 0.48.
Further, the step of normalizing the class labeling error rate and determining a class labeling error index according to the processed class labeling error rate includes:
(c1) if the processed class labeling error rate is greater than a second preset threshold, determining the processed class labeling error rate as a class labeling error index; or
(c2) And if the processed class marking error rate is less than or equal to a second preset threshold, determining the second preset threshold as a class marking error index.
In the above action (c1), if the processed class labeling error rate is greater than the second preset threshold, the processed class labeling error rate is determined as the class labeling error index, and if the normalized class labeling error rate is greater than the second preset threshold, the second preset threshold value range is (0,1), and in a preferred embodiment, the second preset threshold value range is 0.5, the normalized class labeling error rate is determined as the class labeling error index.
In the above operation (c2), if the processed class label error rate is less than or equal to the second preset threshold, the second preset threshold is determined as the class label error index, for example, if the second threshold is 0.6, and the normalized class label error rate is 0.5, the class label error index is 0.6.
Further, determining error labeling data in the to-be-audited labeling data according to the error probability value and a first preset threshold value, wherein the method comprises the following steps:
(d1) and determining the to-be-audited marking data with the error probability value larger than the first preset threshold value as error marking data.
In the above action (d1), the to-be-audited annotation data with the error probability value greater than the first preset threshold is determined as the error annotation data, where a value range of the first preset threshold is (0,1), and in a preferred embodiment, a value of the first preset threshold is 0.6.
Further, before acquiring the annotation data to be audited, the method includes:
(e1) acquiring historical checked and labeled data, and classifying the historical checked and labeled data;
(e2) determining a class center vector corresponding to each type of audited data according to the classified historical audited marking data;
(e3) and training the constructed annotation data audit model according to the historical audited annotation data and the corresponding class center vector to obtain a preset annotation data audit model.
In the above-mentioned operation (e1), the history checked and labeled data is acquired, and the history checked and labeled data is classified. The historical checked and labeled data may be the checked and labeled data of the last batch of the labeled data to be checked and labeled of the current batch, or may be the checked and labeled data preset in the database, and no special limitation is made here. And classifying the historical checked and labeled data, namely classifying the historical checked and labeled data according to certain attributes, for example, classifying a batch of checked and labeled data into animal labeled data and plant labeled data according to category attributes.
In the above-mentioned act (e2), a class center vector corresponding to each type of the inspected data is determined according to the classified historical inspected labeled data, and a class center vector corresponding to each type of the inspected data is determined according to a two-classification method.
In the action (e3), the constructed annotation data audit model is trained according to the historical audited annotation data and the corresponding class center vector, so as to obtain a preset annotation data audit model. The constructed annotation data auditing model is obtained by training according to preset annotation data in a supervision machine learning module, the specific process is to screen a part of annotation data in the preset annotation data for auditing and labeling, namely the alignment of the preset annotation data is based on auditing labels, the part of the annotated data after auditing comprises annotation information and auditing information, and the annotation information and the auditing information are substituted into the supervision machine learning module for training to obtain the constructed annotation data.
Specifically, the supervised machine learning module may be one or more of a convolutional neural network, a binary vector algorithm, a deep neural network, and a logistic regression algorithm, and is not particularly limited herein.
Further, training the constructed annotation data audit model according to the historical audited annotation data and the corresponding class center vector to obtain a preset annotation data audit model, and the method comprises the following steps:
(f1) acquiring the marking information and the auditing information of the historical audited marking data, determining the same information or different information of the historical audited marking data according to the marking information and the auditing information, and using the same information or different information as a first data group;
(f2) vectorizing the historical checked and labeled data, and taking the processed historical checked and labeled data and the center of class corresponding to the historical checked and labeled data as a second data group;
(f3) and substituting the first data group and the second data group into the constructed labeled data auditing model for training to obtain a preset labeled data auditing model.
In the above-mentioned operation (f1), the label information and the audit information of the history audited label data are obtained, the same information or different information of the history audited label data is determined according to the label information and the audit information, and the same information or different information is used as the first data. For example, if the label information of a segment of movie video is a war subject, the audit information is a science fiction subject, the same information does not exist, and if the different information is a war subject and a science fiction subject, the different information is used as the first data. Or for example, if the marked information of one picture is a elephant and the audit information is a elephant, no different information exists, and the same information is the elephant.
In the above actions (f2) and (f3), vectorization processing is performed on the historical checked and labeled data, the processed historical checked and labeled data and the class center amount corresponding to the historical checked and labeled data are used as a second data group, and the first data group and the second data group are substituted into the constructed labeled data check model for training, so as to obtain the preset labeled data check model. For example, the history audited and labeled data after vectorization processing is a1, and the class center vector corresponding to the history audited and labeled data is a, the second data group is (a1, a), and the second data group is B, the (a1, a) and the (B) are used as data groups to be substituted into the constructed labeled data audit model for training, so as to obtain a preset labeled data audit model, wherein the preset labeled audit model calculates the result as the estimated error rate of the labeled data.
In the embodiment of the invention, to-be-audited marking data of a current batch are obtained, error marking data are determined in the to-be-audited marking data according to a preset category statistical index and a preset marking data auditing model, and the error marking data are determined in the to-be-audited marking data according to the preset category statistical index and the preset marking data auditing model; the annotation data auditing model is obtained by utilizing the annotation data of the last batch of current batch to be annotated for updating and training. In the embodiment, the error labeling data is determined in the to-be-verified labeling data through the preset category statistical index and the preset labeling data verification model, so that the labeling personnel can directly re-label the determined error labeling data, and the quality inspection work efficiency of the labeling data is improved.
Further, referring to fig. 1, according to a second aspect of the present embodiment, there is provided a storage medium. The storage medium includes a stored program, wherein the method for identifying the error marking data is executed by a processor when the program runs.
Therefore, according to the embodiment, to-be-audited marked data of the current batch are obtained, and error marked data are determined in the to-be-audited marked data according to the preset category statistical index and the preset marked data auditing model; the annotation data auditing model is obtained by utilizing the annotation data of the last batch of current batch to be annotated for updating and training. In the embodiment, the error labeling data is determined in the to-be-verified labeling data through the preset category statistical index and the preset labeling data verification model, so that the labeling personnel can directly re-label the determined error labeling data, and the quality inspection work efficiency of the labeling data is improved.
The storage medium provided by the embodiment of the present application can implement the processes in the foregoing method embodiments, and achieve the same functions and effects, which are not repeated here.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.
Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
Example 2
Fig. 3 is a schematic diagram of an apparatus for identifying error marked data according to an embodiment of the present disclosure, where the apparatus 300 corresponds to a method for identifying error marked data according to embodiment 1. Referring to fig. 3, the apparatus 300 includes:
the annotation data acquisition module 301 is configured to acquire to-be-audited annotation data of a current batch;
an error data determination module 302, configured to determine error annotation data in the to-be-audited annotation data according to a preset category statistical indicator and a preset annotation data audit model; and the labeled data auditing model is obtained by utilizing the labeled data of the last batch of the current batch to update and train.
Optionally, the error data determination module 302 is specifically configured to:
acquiring the class marking error rate of the data to be audited; wherein the class labeling error rate is determined according to historical audited labeling data;
determining the estimated error rate of the to-be-audited annotation data according to the preset annotation data audit model;
determining the error probability value of the to-be-audited annotation data according to the class annotation error rate and the estimated error rate;
and determining the error labeling data in the labeling data to be audited according to the error probability value and a first preset threshold value.
Optionally, the error data determination module 302 is further specifically configured to:
carrying out normalization processing on the class labeling error rate, and determining a class labeling error index according to the processed class labeling error rate;
and determining the product of the class marking error index and the estimated error rate as the error probability value of the marking data to be audited.
Optionally, the error data determination module 302 is further specifically configured to:
if the processed class labeling error rate is larger than a second preset threshold value, determining the processed class labeling error rate as a class labeling error index; or
And if the processed class labeling error rate is less than or equal to a second preset threshold, determining the second preset threshold as a class labeling error index.
Optionally, the error data determination module 302 is further specifically configured to:
and determining the to-be-audited marking data with the error probability value larger than a first preset threshold value as error marking data.
Optionally, an auditing model module is included for, before obtaining the annotation data to be audited,
acquiring historical checked and labeled data, and classifying the historical checked and labeled data;
determining a class center vector corresponding to each type of the checked data according to the classified historical checked and labeled data;
and training the constructed annotation data audit model according to the historical audited annotation data and the corresponding class center vector to obtain the preset annotation data audit model.
Optionally, the audit model module is specifically configured to:
obtaining the marking information and the auditing information of the historical audited marking data, determining the same information or different information of the historical audited marking data according to the marking information and the auditing information, and taking the same information or the different information as a first data group;
vectorizing the historical checked and labeled data, and taking the processed historical checked and labeled data and the center quantity corresponding to the historical checked and labeled data as a second data group;
and substituting the first data group and the second data group into the constructed labeled data auditing model for training to obtain the preset labeled data auditing model.
In the embodiment of the invention, to-be-audited marking data of a current batch are obtained, error marking data are determined in the to-be-audited marking data according to a preset category statistical index and a preset marking data auditing model, and the error marking data are determined in the to-be-audited marking data according to the preset category statistical index and the preset marking data auditing model; the annotation data auditing model is obtained by utilizing the annotation data of the last batch of current batch to be annotated for updating and training. In the embodiment, the error labeling data is determined in the to-be-verified labeling data through the preset category statistical index and the preset labeling data verification model, so that the labeling personnel can directly re-label the determined error labeling data, and the quality inspection work efficiency of the labeling data is improved.
The method and the device for identifying the wrong labeling data provided by the embodiment of the application can realize each process in the method embodiment and achieve the same functions and effects, and are not repeated here.
Example 3
Fig. 4 is a schematic diagram of an apparatus for identifying incorrect annotation data according to another embodiment of the disclosure, wherein the apparatus 400 corresponds to the method according to the first aspect of embodiment 1. Referring to fig. 4, the apparatus 400 includes: a processor 410; and a memory 420 coupled to the processor 410 for providing instructions to the processor 410 to process the following process steps:
acquiring to-be-audited marking data of a current batch; and
determining error marking data in the to-be-checked marking data according to a preset category statistical index and a preset marking data checking model; and the labeled data auditing model is obtained by utilizing the labeled data of the last batch of the current batch to update and train.
Optionally, determining error annotation data in the annotation data to be audited according to a preset category statistical index and a preset annotation data audit model, including:
acquiring the class marking error rate of the data to be audited; wherein the class labeling error rate is determined according to historical audited labeling data;
determining the estimated error rate of the to-be-audited annotation data according to the preset annotation data audit model;
determining the error probability value of the to-be-audited annotation data according to the class annotation error rate and the estimated error rate;
and determining the error labeling data in the labeling data to be audited according to the error probability value and a first preset threshold value.
Optionally, determining an error probability value of the to-be-audited annotation data according to the class annotation error rate and the estimated error rate includes:
carrying out normalization processing on the class labeling error rate, and determining a class labeling error index according to the processed class labeling error rate;
and determining the product of the class marking error index and the estimated error rate as the error probability value of the marking data to be audited.
Optionally, normalizing the class labeling error rate, and determining a class labeling error index according to the processed class labeling error rate, including:
if the processed class labeling error rate is larger than a second preset threshold value, determining the processed class labeling error rate as a class labeling error index; or
And if the processed class labeling error rate is less than or equal to a second preset threshold, determining the second preset threshold as a class labeling error index.
Optionally, determining error tagging data in the tagging data to be audited according to the error probability value and a first preset threshold, including:
and determining the to-be-audited marking data with the error probability value larger than a first preset threshold value as error marking data.
Optionally, before the obtaining of the annotation data to be audited, the method includes:
acquiring historical checked and labeled data, and classifying the historical checked and labeled data;
determining a class center vector corresponding to each type of the checked data according to the classified historical checked and labeled data;
and training the constructed annotation data audit model according to the historical audited annotation data and the corresponding class center vector to obtain the preset annotation data audit model.
Optionally, training the constructed annotation data audit model according to the historical audited annotation data and the corresponding class center vector to obtain the preset annotation data audit model, including:
obtaining the marking information and the auditing information of the historical audited marking data, determining the same information or different information of the historical audited marking data according to the marking information and the auditing information, and taking the same information or the different information as a first data group;
vectorizing the historical checked and labeled data, and taking the processed historical checked and labeled data and the center quantity corresponding to the historical checked and labeled data as a second data group;
and substituting the first data group and the second data group into the constructed labeled data auditing model for training to obtain the preset labeled data auditing model.
In the embodiment of the invention, to-be-audited marking data of a current batch are obtained, error marking data are determined in the to-be-audited marking data according to a preset category statistical index and a preset marking data auditing model, and the error marking data are determined in the to-be-audited marking data according to the preset category statistical index and the preset marking data auditing model; the annotation data auditing model is obtained by utilizing the annotation data of the last batch of current batch to be annotated for updating and training. In the embodiment, the error labeling data is determined in the to-be-verified labeling data through the preset category statistical index and the preset labeling data verification model, so that the labeling personnel can directly re-label the determined error labeling data, and the quality inspection work efficiency of the labeling data is improved.
The device for identifying the wrong labeling data provided by the embodiment of the application can realize each process in the method embodiment and achieve the same function and effect, and the process is not repeated here.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (10)

1. A method for identifying mislabeled data, comprising:
acquiring to-be-audited marking data of a current batch; and
determining error marking data in the to-be-checked marking data according to a preset category statistical index and a preset marking data checking model; and the labeled data auditing model is obtained by utilizing the labeled data of the last batch of the current batch to update and train.
2. The method according to claim 1, wherein determining the error labeled data in the labeled data to be checked according to a preset category statistical index and a preset labeled data checking model comprises:
acquiring the class marking error rate of the data to be audited; wherein the class labeling error rate is determined according to the labeling data of the previous batch of labels;
determining the estimated error rate of the to-be-audited annotation data according to the preset annotation data audit model;
determining the error probability value of the to-be-audited annotation data according to the class annotation error rate and the estimated error rate;
and determining the error labeling data in the labeling data to be audited according to the error probability value and a first preset threshold value.
3. The method of claim 2, wherein determining the error probability value of the to-be-audited annotation data according to the class annotation error rate and the estimated error rate comprises:
carrying out normalization processing on the class labeling error rate, and determining a class labeling error index according to the processed class labeling error rate;
and determining the product of the class marking error index and the estimated error rate as the error probability value of the marking data to be audited.
4. The method of claim 3, wherein normalizing the class labeling error rate and determining a class labeling error index according to the processed class labeling error rate comprises:
if the processed class labeling error rate is larger than a second preset threshold value, determining the processed class labeling error rate as a class labeling error index; or
And if the processed class labeling error rate is less than or equal to a second preset threshold, determining the second preset threshold as a class labeling error index.
5. The method according to claim 2, wherein determining the error annotation data in the annotation data to be audited according to the error probability value and a first preset threshold comprises:
and determining the to-be-audited marking data with the error probability value larger than a first preset threshold value as error marking data.
6. The method according to claim 1, wherein before acquiring the annotation data to be reviewed, the method comprises:
acquiring historical checked and labeled data, and classifying the historical checked and labeled data;
determining a class center vector corresponding to each type of the checked data according to the classified historical checked and labeled data;
and training the constructed annotation data audit model according to the historical audited annotation data and the corresponding class center vector to obtain the preset annotation data audit model.
7. The method of claim 6, wherein training the constructed annotation data audit model according to the historical audited annotation data and the corresponding class center vector to obtain the preset annotation data audit model comprises:
obtaining the marking information and the auditing information of the historical audited marking data, determining the same information or different information of the historical audited marking data according to the marking information and the auditing information, and taking the same information or the different information as a first data group;
vectorizing the historical checked and labeled data, and taking the processed historical checked and labeled data and the center quantity corresponding to the historical checked and labeled data as a second data group;
and substituting the first data group and the second data group into the constructed labeled data auditing model for training to obtain the preset labeled data auditing model.
8. A storage medium comprising a stored program, wherein the method of identifying mislabelling data as claimed in any of claims 1 to 7 is performed by a processor when the program is run.
9. An apparatus for identifying mislabeled data, comprising:
the marking data acquisition module is used for acquiring marking data to be audited of the current batch; and
the error data determining module is used for determining error marking data in the to-be-checked marking data according to a preset category statistical index and a preset marking data checking model; and the labeled data auditing model is obtained by utilizing the labeled data of the last batch of the current batch to update and train.
10. An apparatus for identifying mislabeled data, comprising:
a processor; and
a memory coupled to the first processor for providing instructions to the first processor to process the following process steps:
acquiring to-be-audited marking data of a current batch; and
determining error marking data in the to-be-checked marking data according to a preset category statistical index and a preset marking data checking model; and the labeled data auditing model is obtained by utilizing the labeled data of the last batch of the current batch to update and train.
CN201911055046.2A 2019-10-31 2019-10-31 Error labeling data identification method, device and medium Active CN112749150B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911055046.2A CN112749150B (en) 2019-10-31 2019-10-31 Error labeling data identification method, device and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911055046.2A CN112749150B (en) 2019-10-31 2019-10-31 Error labeling data identification method, device and medium

Publications (2)

Publication Number Publication Date
CN112749150A true CN112749150A (en) 2021-05-04
CN112749150B CN112749150B (en) 2023-11-03

Family

ID=75645548

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911055046.2A Active CN112749150B (en) 2019-10-31 2019-10-31 Error labeling data identification method, device and medium

Country Status (1)

Country Link
CN (1) CN112749150B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114565360A (en) * 2022-03-01 2022-05-31 北京鉴智科技有限公司 Method and device for auditing labeled data, electronic equipment and readable storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102662930A (en) * 2012-04-16 2012-09-12 乐山师范学院 Corpus tagging method and corpus tagging device
US20180032900A1 (en) * 2016-07-27 2018-02-01 International Business Machines Corporation Greedy Active Learning for Reducing Labeled Data Imbalances
CN108171335A (en) * 2017-12-06 2018-06-15 东软集团股份有限公司 Choosing method, device, storage medium and the electronic equipment of modeling data
CN108228557A (en) * 2016-12-14 2018-06-29 北京国双科技有限公司 A kind of method and device of sequence labelling
CN108536662A (en) * 2018-04-16 2018-09-14 苏州大学 A kind of data mask method and device
CN108734296A (en) * 2017-04-21 2018-11-02 北京京东尚科信息技术有限公司 Optimize method, apparatus, electronic equipment and the medium of the training data of supervised learning
CN109389275A (en) * 2017-08-08 2019-02-26 北京图森未来科技有限公司 A kind of image labeling method and device
CN109635110A (en) * 2018-11-30 2019-04-16 北京百度网讯科技有限公司 Data processing method, device, equipment and computer readable storage medium
CN109993315A (en) * 2019-03-29 2019-07-09 联想(北京)有限公司 A kind of data processing method, device and electronic equipment

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102662930A (en) * 2012-04-16 2012-09-12 乐山师范学院 Corpus tagging method and corpus tagging device
US20180032900A1 (en) * 2016-07-27 2018-02-01 International Business Machines Corporation Greedy Active Learning for Reducing Labeled Data Imbalances
CN108228557A (en) * 2016-12-14 2018-06-29 北京国双科技有限公司 A kind of method and device of sequence labelling
CN108734296A (en) * 2017-04-21 2018-11-02 北京京东尚科信息技术有限公司 Optimize method, apparatus, electronic equipment and the medium of the training data of supervised learning
CN109389275A (en) * 2017-08-08 2019-02-26 北京图森未来科技有限公司 A kind of image labeling method and device
CN108171335A (en) * 2017-12-06 2018-06-15 东软集团股份有限公司 Choosing method, device, storage medium and the electronic equipment of modeling data
CN108536662A (en) * 2018-04-16 2018-09-14 苏州大学 A kind of data mask method and device
CN109635110A (en) * 2018-11-30 2019-04-16 北京百度网讯科技有限公司 Data processing method, device, equipment and computer readable storage medium
CN109993315A (en) * 2019-03-29 2019-07-09 联想(北京)有限公司 A kind of data processing method, device and electronic equipment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114565360A (en) * 2022-03-01 2022-05-31 北京鉴智科技有限公司 Method and device for auditing labeled data, electronic equipment and readable storage medium

Also Published As

Publication number Publication date
CN112749150B (en) 2023-11-03

Similar Documents

Publication Publication Date Title
CN106485261B (en) Image recognition method and device
CN110008980B (en) Identification model generation method, identification device, identification equipment and storage medium
CN112906361A (en) Text data labeling method and device, electronic equipment and storage medium
CN113656690A (en) Product recommendation method and device, electronic equipment and readable storage medium
CN113268665A (en) Information recommendation method, device and equipment based on random forest and storage medium
CN116168848A (en) Drug information matching method and device based on Faiss algorithm
CN113505273B (en) Data sorting method, device, equipment and medium based on repeated data screening
CN114638501A (en) Business data processing method and device, computer equipment and storage medium
CN112749150A (en) Method, device and medium for identifying error marking data
CN112541688B (en) Service data verification method and device, electronic equipment and computer storage medium
CN112633988A (en) User product recommendation method and device, electronic equipment and readable storage medium
CN106651408B (en) Data analysis method and device
CN112948526A (en) User portrait generation method and device, electronic equipment and storage medium
CN108733784B (en) Teaching courseware recommendation method, device and equipment
CN107071553A (en) Method, device and computer readable storage medium for modifying video and voice
CN115861606A (en) Method and device for classifying long-tail distribution documents and storage medium
CN110287492A (en) Data processing and main category identification method and device, electronic equipment and storage medium
CN115859973A (en) Text feature extraction method and device, nonvolatile storage medium and electronic equipment
CN115730037A (en) Target text retrieval method and device
CN111597296A (en) Commodity data processing method, device and system
CN107329946B (en) Similarity calculation method and device
CN112541357B (en) Entity identification method and device and intelligent equipment
CN115203364A (en) Software fault feedback processing method, device, equipment and readable storage medium
CN116414938A (en) Knowledge point labeling method, device, equipment and storage medium
CN113536111A (en) Insurance knowledge content recommendation method and device and terminal equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant