CN112749725A - Method, device and medium for processing labeled data - Google Patents

Method, device and medium for processing labeled data Download PDF

Info

Publication number
CN112749725A
CN112749725A CN201911056493.XA CN201911056493A CN112749725A CN 112749725 A CN112749725 A CN 112749725A CN 201911056493 A CN201911056493 A CN 201911056493A CN 112749725 A CN112749725 A CN 112749725A
Authority
CN
China
Prior art keywords
data
error
labeled
marking
marked
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911056493.XA
Other languages
Chinese (zh)
Inventor
刘睿
靳丁南
罗欢
权圣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhongguancun Kejin Technology Co Ltd
Original Assignee
Beijing Zhongguancun Kejin Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhongguancun Kejin Technology Co Ltd filed Critical Beijing Zhongguancun Kejin Technology Co Ltd
Priority to CN201911056493.XA priority Critical patent/CN112749725A/en
Publication of CN112749725A publication Critical patent/CN112749725A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/40Software arrangements specially adapted for pattern recognition, e.g. user interfaces or toolboxes therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Abstract

The application discloses a processing method, a device and a storage medium of label data, wherein the method comprises the following steps: acquiring marked data marked by marked personnel; determining error marking data in the marking data by utilizing a pre-constructed marking data auditing model according to the marking data, wherein the marking data auditing model is trained by utilizing the audited marking data; and sending the error marking data to the marking personnel. By the embodiment, the utilization rate of the checked labeled data and the labeling efficiency of labeling the labeled data again can be improved.

Description

Method, device and medium for processing labeled data
Technical Field
The present application relates to the field of big data, and in particular, to a method, an apparatus, and a medium for processing annotation data.
Background
With the development of communication technology, the demand of fields such as artificial intelligence and the like on the labeled data is increasing day by day, and the requirements on the accuracy of the labeled data are high no matter in the field of image recognition or the field of character classification.
The method for marking data at present is to manually mark data by a marking worker, extract part of marked data from a batch of marked data to perform auditing and checking, and calculate the accuracy of the marked data after auditing and checking, if the accuracy does not reach the standard, judge that the accuracy of the batch of marked data is not qualified, and require the marking worker to re-mark the batch of marked data until the accuracy is qualified.
The above process has the following problems: firstly, the marked data after the auditing and checking cannot be further utilized, and the utilization rate of the data is low; in addition, after the labeling data with unqualified accuracy is returned to the labeling personnel, the labeling personnel do not know which labeling data are wrong, so that only all data can be re-labeled, and the efficiency of re-labeling the labeling data is low.
Embodiments of the present disclosure provide a method, an apparatus, and a medium for processing tagged data, so as to improve utilization rate of the tagged data after being audited and tagging efficiency of re-tagging the tagged data.
Disclosure of Invention
The embodiment of the disclosure provides a processing method and device of labeled data and a storage medium, which can improve the utilization rate of the labeled data after being audited and the labeling efficiency of labeling the labeled data.
To solve the above technical problem, the embodiment of the present invention is implemented as follows:
in a first aspect, an embodiment of the present disclosure provides a method for processing annotation data, including:
acquiring marked data marked by marked personnel;
determining error marking data in the marking data by utilizing a pre-constructed marking data auditing model according to the marking data, wherein the marking data auditing model is trained by utilizing the audited marking data;
and sending the error marking data to the marking personnel.
In a second aspect, the disclosed embodiments further provide a storage medium, which includes a stored program, wherein the processing method of the annotation data as described in the first aspect above is executed by a processor when the program runs.
In a third aspect, there is also provided an apparatus for processing annotation data according to an embodiment of the present disclosure, including:
the annotation data acquisition module is used for acquiring annotation data annotated by the annotation personnel;
the error data determining module is used for determining error marking data in the marking data according to the marking data by utilizing a pre-constructed marking data auditing model, wherein the marking data auditing model is trained by utilizing the approved marking data;
and the marking data sending module is used for sending the error marking data to the marking personnel.
In a fourth aspect, an embodiment of the present disclosure further provides a device for processing annotation data, including:
a processor;
a memory coupled to the processor for providing instructions to the processor for processing the following processing steps:
acquiring marked data marked by marked personnel;
determining error marking data in the marking data by utilizing a pre-constructed marking data auditing model according to the marking data, wherein the marking data auditing model is trained by utilizing the audited marking data;
and sending the error marking data to the marking personnel.
In the embodiment of the invention, the marked data marked by the marked personnel is obtained, the pre-constructed marked data auditing model is utilized according to the marked data (wherein the marked data auditing model is obtained by utilizing the marked data which is audited in a training way), the error marked data in the marked data is determined, and the error marked data is sent to the marked personnel. Therefore, in the embodiment, the error annotation data in the annotation data is determined by using the pre-constructed annotation data auditing model, so that the annotation personnel can re-annotate the error annotation data, and the efficiency of re-annotating the annotation data is improved. In addition, in the embodiment, the annotation data audit model is obtained by training the audited annotation data, so that the utilization rate of the audited annotation data is improved. Therefore, the technical scheme of the embodiment improves the utilization rate of the checked labeled data and the labeling efficiency of labeling the labeled data again.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the disclosure and together with the description serve to explain the disclosure and not to limit the disclosure. In the drawings:
fig. 1 is a block diagram of a hardware structure of a computing device for implementing a method for processing annotation data according to an embodiment of the present disclosure;
fig. 2 is a schematic flowchart of a processing method of annotation data according to an embodiment of the disclosure;
FIG. 3 is a schematic diagram of a device for processing annotation data according to an embodiment of the disclosure;
fig. 4 is a schematic diagram of a device for processing annotation data according to another embodiment of the disclosure.
Detailed Description
In order to make those skilled in the art better understand the technical solutions of the present disclosure, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure. It is to be understood that the described embodiments are merely exemplary of some, and not all, of the present disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Example 1
According to the present embodiment, there is also provided an embodiment of a method for processing annotation data, it should be noted that the steps shown in the flowchart of the figure may be executed in a computer system such as a set of computer executable instructions, and that although a logical order is shown in the flowchart, in some cases, the steps shown or described may be executed in an order different from that here.
The method embodiments provided by the present embodiment may be executed in a mobile terminal, a computer terminal, a server or a similar computing device. Fig. 1 shows a hardware block diagram of a computing device for implementing a method for processing annotation data. As shown in fig. 1, the computing device may include one or more processors (which may include, but are not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), a memory for storing data, and a transmission device for communication functions. Besides, the method can also comprise the following steps: a display, an input/output interface (I/O interface), a Universal Serial Bus (USB) port (which may be included as one of the ports of the I/O interface), a network interface, a power source, and/or a camera. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration and is not intended to limit the structure of the electronic device. For example, the computing device may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.
It should be noted that the one or more processors and/or other data processing circuitry described above may be referred to generally herein as "data processing circuitry". The data processing circuitry may be embodied in whole or in part in software, hardware, firmware, or any combination thereof. Further, the data processing circuitry may be a single, stand-alone processing module, or incorporated in whole or in part into any of the other elements in the computing device. As referred to in the disclosed embodiments, the data processing circuit acts as a processor control (e.g., selection of a variable resistance termination path connected to the interface).
The memory can be used for storing software programs and modules of application software, such as program instructions/data storage devices corresponding to the processing method of the annotation data in the embodiment of the disclosure, and the processor executes various functional applications and data processing by running the software programs and modules stored in the memory, that is, the processing method of the annotation data of the application program is realized. The memory may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory may further include memory located remotely from the processor, which may be connected to the computing device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission device is used for receiving or transmitting data via a network. Specific examples of such networks may include wireless networks provided by communication providers of the computing devices. In one example, the transmission device includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission device may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.
The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the computing device.
It should be noted here that in some alternative embodiments, the computing device shown in fig. 1 described above may include hardware elements (including circuitry), software elements (including computer code stored on a computer-readable medium), or a combination of both hardware and software elements. It should be noted that FIG. 1 is only one example of a particular specific example and is intended to illustrate the types of components that may be present in a computing device as described above.
In the above operating environment, the present embodiment provides a method for processing annotation data. Fig. 2 is a schematic flow chart of a processing method of annotation data according to an embodiment of the present disclosure, and referring to fig. 2, the method includes:
s202: acquiring marked data marked by marked personnel;
s204: determining error marking data in the marking data by utilizing a pre-constructed marking data auditing model according to the marking data, wherein the marking data auditing model is obtained by utilizing the audited marking data for training;
s206: and sending the error marking data to a marking person.
In the embodiment of the invention, the marked data marked by the marked personnel is obtained, the pre-constructed marked data auditing model is utilized according to the marked data (wherein the marked data auditing model is obtained by utilizing the marked data which is audited in a training way), the error marked data in the marked data is determined, and the error marked data is sent to the marked personnel. Therefore, in the embodiment, the error annotation data in the annotation data is determined by using the pre-constructed annotation data auditing model, so that the annotation personnel can re-annotate the error annotation data, and the efficiency of re-annotating the annotation data is improved. In addition, in the embodiment, the annotation data audit model is obtained by training the audited annotation data, so that the utilization rate of the audited annotation data is improved. Therefore, the technical scheme of the embodiment improves the utilization rate of the checked labeled data and the labeling efficiency of labeling the labeled data again.
In the step S202, the annotation data annotated by the annotating personnel is obtained, where the annotation data may be label annotation for pictures, characters, audio, video, or other data, and no special limitation is imposed here, for example, after five photos are annotated, the result is: the first one is labeled cat, the second one is labeled dog, the third one is labeled rabbit, the fourth one is labeled dog, and the fifth one is labeled panda.
In the step S204, the error annotation data in the annotation data is determined according to the annotation data by using a pre-constructed annotation data audit model, wherein the annotation data audit model is obtained by training with the audited annotation data.
Specifically, when a batch of annotation data is determined to be unqualified data and needs to be returned to the annotating personnel for re-annotation, in order to improve the efficiency of re-annotation, the annotating personnel needs to be informed of which annotation data are wrong. Alternatively, a pre-audit may be required prior to a manual audit, preferably automatically by machine.
In order to achieve the purpose, in this embodiment, the annotation data is substituted into a pre-constructed annotation data auditing model, and the error annotation data in the annotation data is determined, where the annotation data auditing model is obtained by using the audited annotation data for training. The checked labeled data may be checked labeled data in a preselected labeled data sample, or may be checked labeled data obtained by screening part of the labeled data from the labeled data, and no special limitation is imposed here.
In the step S206, the error annotation data is sent to the annotating personnel, so that the annotating personnel annotates the error annotation data in the annotation data to improve the quality of the annotation data, thereby improving the efficiency of annotating the annotation data.
Further, determining error annotation data in the annotation data by using a pre-constructed annotation data auditing model according to the annotation data, wherein the method comprises the following steps:
(a1) substituting the labeled data into a pre-constructed labeled data auditing model to obtain the error probability of the labeled data;
(a2) and determining the error labeling data in the labeling data according to the error probability of the labeling data and a preset judgment condition.
In the above actions (a1) and (a2), the annotation data is substituted into a pre-constructed annotation data auditing model to obtain the error probability of the annotation data, and the error annotation data in the annotation data is determined according to the error probability of the annotation data and the preset judgment condition, wherein the proportion threshold value may be set according to the accuracy that the annotation data needs to reach, or other values may be set, and no special limitation is made here. For example, 5 pieces of labeled data a, b, c, d, and e are substituted into a labeled data auditing model constructed in advance, the obtained error probability of the labeled data is 0.5, 0.2, 0.4, 0.6, and 0.8 in sequence, the proportion threshold is 0.1, and the error labeled data in the labeled data is determined according to the error probability of the labeled data and a preset judgment condition.
Further, determining the error labeling data in the labeling data according to the error probability of the labeling data and a preset judgment condition, including:
(b1) arranging the labeled data according to the sequence of the error probability from large to small;
(b2) calculating the preset number of error marking data in the marking data according to the preset proportion;
(b3) and determining the preset number of the arranged marking data in the marking data as error marking data.
In the above-mentioned actions (b1) and (b2), the labeled data are arranged in the order of decreasing error probability, and the preset number of the error labeled data in the labeled data is calculated according to the preset ratio, that is, the product of the preset ratio and the number of the labeled data is the preset number of the error labeled data in the labeled data.
In the above-described operation (b3), the preset number of label data in the arranged label data is determined as error label data. For example, if the arranged labeled data is A, B, C, D, E and the calculated preset number is 2, the a and B data are determined as error data.
In one embodiment, the 10 pieces of labeled data a, b, c, d, e, f, g, h, i, j are arranged in the order of decreasing error probability, g, j, f, i, h, a, d, e, c, b are preset in a ratio of 0.2, and g and j are determined as error data if the preset number is 10 × 0.2 — 2.
Further, determining the error labeling data in the labeling data according to the error probability of the labeling data and a preset judgment condition, including:
(c1) screening a predetermined amount of labeled data from the labeled data for auditing to obtain the error rate of the labeled data;
(c2) determining the number of the error marked data needing to be checked and marked when the marked data reach the preset accuracy rate according to the error rate of the marked data and the preset accuracy rate of the marked data;
(c3) arranging the labeled data according to the sequence of the error probability from large to small;
(c4) and sequentially auditing and labeling the arranged labeled data until the quantity of the error labeled data is determined to reach the quantity of the error labeled data needing to be audited and labeled.
In the above-mentioned actions (c1) and (c2), a predetermined amount of labeled data is screened from the labeled data for auditing to obtain an error rate of the labeled data, and the amount of the labeled data that needs to be audited when the labeled data reaches a preset accuracy rate is determined according to the error rate of the labeled data and the preset accuracy rate of the labeled data, where the auditing is to judge whether the labeled data is correct, and can correctly label the wrong labeled data. In an embodiment, 10 predetermined quantities of labeled data are screened from 100 labeled data for auditing, two of the 10 predetermined quantities of labeled data are error labeled data as an auditing result, the error rate of the labeled data is 20%, the preset correct rate required to be achieved in the quality inspection of the labeled data at this time is 100%, and the quantity of the error labeled data in the other 90 labeled data after screening is calculated to be 90 × 20 — 18.
In another embodiment, 50 predetermined quantities of labeled data are screened from 200 labeled data for auditing, and the auditing result shows that 10 labeled data out of the 50 predetermined quantities of labeled data are error labeled data, so that the error rate of the labeled data is 20%, and the preset correct rate required to be achieved in the quality inspection of the labeled data at this time is 90%, then the quantity of error labeled data in the other 150 labeled data after screening is calculated to be 150 × 20% - [150 (1-90%) ] ═ 15.
In the above-mentioned actions (c3) and (c4), the labeled data are arranged in the order of decreasing error probability, and the examination labeling is performed on the arranged labeled data in sequence until it is determined that the number of the error labeled data reaches the number of the error labeled data that needs to be examined and labeled. Further, the annotation data may be arranged in the order of the error probability from large to small, and the annotation data of the number of previous error annotation data in the arranged annotation data may be determined as the error annotation data. For example, 10 pieces of labeled data are arranged into a, b, c, d, e, f, g, h, i, and j in the order from large to small in error probability, the number of the error labeled data obtained by calculation is 2, the arranged labeled data are sequentially checked until the number of the error labeled data is determined to reach 2, two checked labeled data are relabeled to reach a preset correct rate, or the first 2 pieces of labeled data in the arranged labeled data can be directly determined as error labeled data, that is, a and b are determined as error labeled data.
Further, determining error annotation data in the annotation data by using a pre-constructed annotation data auditing model according to the annotation data, and the method further comprises the following steps: the method comprises the steps of screening a preset number of labeled data from labeled data for auditing to obtain the error rate of the labeled data, determining the number of the labeled data which needs to be audited and labeled when the labeled data reach the preset accuracy rate according to the error rate of the labeled data and the preset accuracy rate of the labeled data, arranging the labeled data according to the sequence of the error probability from large to small, and determining the labeled data of the number which is preset times the number of the arranged labeled data as the error labeled data. For example, 10 pieces of labeled data are arranged in the order of a, b, c, d, e, f, g, h, i, and j from the largest error probability to the smallest error probability, the number of the error labeled data obtained by calculation is 2, and the preset multiple value is 2, and the labeled data of the first 2 × 2 — 4 after arrangement is determined as error data, that is, the error data is a, b, c, and d.
Further, before obtaining the annotation data annotated by the annotated person, the method comprises:
(d1) acquiring the checked and marked data, wherein the checked and marked data comprises marking information and checking information;
(d2) and obtaining a labeled data auditing model through a supervising machine learning module according to the labeled information and the auditing information.
In the above action (d1), the audited marked data is obtained, where the audited marked data includes marking information and auditing information, for example, the audited marked data is an auditing result of a marked picture, where the marked data is that the picture is a willow tree, and the auditing information is that the picture is a marking error.
In the above action (d2), a labeled data audit model is obtained by the supervised machine learning module according to the labeled information and the audit information, and the labeled information and the audit information are substituted into the supervised machine learning module to perform data training to obtain the labeled data audit model.
Further, according to the labeling information and the auditing information, a labeling data auditing model is obtained through a supervision machine learning module, and the method comprises the following steps:
(e1) determining the auditing result of the audited marked data according to the marked information and the auditing information in the audited marked data;
(e2) and substituting the labeling information, the auditing information and the auditing result into the supervision machine learning module for training to obtain a labeling data auditing model.
In the above actions (e1) and (e2), the auditing result of the audited labeled data is determined according to the labeled information and the auditing information in the audited labeled data, and the auditing result is substituted into the supervising machine learning module for training to obtain the labeled data auditing model. For example, the checked and labeled data is a picture of persimmons, the picture is labeled as an apple, and the check information is that the picture is labeled as an error. Therefore, the apple picture and the picture can be marked as error audit information to serve as a sample. Or the audited marked data is a picture of the apple, the marked information is the apple, and the audited information is that the picture is marked correctly. And analogizing in sequence, taking the plurality of audited pictures and the corresponding audit results as training samples, and thus obtaining a sample set. Then, the sample set data can be substituted into the supervision machine learning module, and a labeled data auditing model is obtained through training of multiple groups of data.
Further, the supervised machine learning module is one or more of a convolutional neural network, a binary vector algorithm, a deep neural network and a logistic regression algorithm.
In particular, for example, the convolutional neural network may be configured to generate a two-dimensional vector according to the input already labeled image data, wherein two elements of the two-dimensional vector are used to indicate the probability that the image label is correct or wrong, respectively. When the probability of the labeling error is larger than a preset threshold value, the image labeling error can be judged.
The convolutional layer, the pooling layer, the full link layer, and the softmax classifier included in the convolutional neural network may be set according to actual situations, which is not in the scope of the important description of the embodiment. The method can be used for classifying the input pictures as long as the input pictures can be classified as being correct in labeling or wrong in labeling.
In the embodiment of the invention, the marked data marked by the marked personnel is obtained, the pre-constructed marked data auditing model is utilized according to the marked data (wherein the marked data auditing model is obtained by utilizing the marked data which is audited in a training way), the error marked data in the marked data is determined, and the error marked data is sent to the marked personnel. Therefore, in the embodiment, the error annotation data in the annotation data is determined by using the pre-constructed annotation data auditing model, so that the annotation personnel can re-annotate the error annotation data, and the efficiency of re-annotating the annotation data is improved. In addition, in the embodiment, the annotation data audit model is obtained by training the audited annotation data, so that the utilization rate of the audited annotation data is improved. Therefore, the technical scheme of the embodiment improves the utilization rate of the checked labeled data and the labeling efficiency of labeling the labeled data again.
Further, referring to fig. 1, according to a second aspect of the present embodiment, there is provided a storage medium. The storage medium includes a stored program, wherein the processing method of the annotation data described above is executed by a processor when the program runs.
In the embodiment of the invention, the marked data marked by the marked personnel is obtained, the pre-constructed marked data auditing model is utilized according to the marked data (wherein the marked data auditing model is obtained by utilizing the marked data which is audited in a training way), the error marked data in the marked data is determined, and the error marked data is sent to the marked personnel. Therefore, in the embodiment, the error annotation data in the annotation data is determined by using the pre-constructed annotation data auditing model, so that the annotation personnel can re-annotate the error annotation data, and the efficiency of re-annotating the annotation data is improved. In addition, in the embodiment, the annotation data audit model is obtained by training the audited annotation data, so that the utilization rate of the audited annotation data is improved. Therefore, the technical scheme of the embodiment improves the utilization rate of the checked labeled data and the labeling efficiency of labeling the labeled data again.
The storage medium provided by the embodiment of the present application can implement the processes in the foregoing method embodiments, and achieve the same functions and effects, which are not repeated here.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.
Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
Example 2
Fig. 3 is a schematic diagram of an apparatus for processing annotation data according to an embodiment of the disclosure, where the apparatus 300 corresponds to a method for processing annotation data according to embodiment 1. Referring to fig. 3, the apparatus 300 includes:
the annotation data acquisition module 301 is configured to acquire annotation data annotated by an annotation person;
an error data determination module 302, configured to determine, according to the labeled data, error labeled data in the labeled data by using a pre-constructed labeled data auditing model, where the labeled data auditing model is trained by using the approved labeled data;
and the marking data sending module 303 is configured to send the error marking data to the marking staff.
Optionally, the error data determination module 302 is specifically configured to:
substituting the labeled data into a pre-constructed labeled data auditing model to obtain the error probability of the labeled data;
and determining the error labeling data in the labeling data according to the error probability of the labeling data and a preset judgment condition.
Optionally, the error data determination module 302 is further specifically configured to:
arranging the labeled data according to the sequence of the error probability from large to small;
calculating the preset number of the error marking data in the marking data according to the preset proportion;
and determining the preset number of the arranged marking data as the error marking data.
Optionally, the error data determination module 302 is specifically configured to:
screening a preset amount of the labeled data from the labeled data for auditing to obtain the error rate of the labeled data;
determining the number of the error marked data needing to be checked and marked when the marked data reach the preset correct rate according to the error rate of the marked data and the preset correct rate of the marked data;
arranging the labeled data according to the sequence of the error probability from large to small;
and sequentially auditing and labeling the arranged labeled data until the number of the error labeled data is determined to reach the number of the error labeled data needing to be audited and labeled.
Optionally, the processing device of the annotation data comprises an auditing model training module for, before obtaining the annotation data annotated by the annotator,
obtaining audited annotation data, wherein the audited annotation data comprises annotation information and audit information;
and obtaining a labeled data auditing model through a supervising machine learning module according to the labeled information and the auditing information.
Optionally, the audit model training module is specifically configured to:
determining an auditing result of the audited marked data according to the marked information in the audited marked data and the auditing information;
and substituting the labeling information, the auditing information and the auditing result into a supervision machine learning module for training to obtain a labeling data auditing model.
Optionally, the supervised machine learning module is one or more of a convolutional neural network, a binary vector algorithm, a deep neural network, and a logistic regression algorithm.
In the embodiment of the invention, the marked data marked by the marked personnel is obtained, the pre-constructed marked data auditing model is utilized according to the marked data (wherein the marked data auditing model is obtained by utilizing the marked data which is audited in a training way), the error marked data in the marked data is determined, and the error marked data is sent to the marked personnel. Therefore, in the embodiment, the error annotation data in the annotation data is determined by using the pre-constructed annotation data auditing model, so that the annotation personnel can re-annotate the error annotation data, and the efficiency of re-annotating the annotation data is improved. In addition, in the embodiment, the annotation data audit model is obtained by training the audited annotation data, so that the utilization rate of the audited annotation data is improved. Therefore, the technical scheme of the embodiment improves the utilization rate of the checked labeled data and the labeling efficiency of labeling the labeled data again.
The processing method and device for the annotation data provided by the embodiment of the application can realize each process in the foregoing method embodiments, and achieve the same functions and effects, which are not repeated here.
Example 3
Fig. 4 is a schematic diagram of an apparatus for processing annotation data according to another embodiment of the present disclosure, wherein the apparatus 400 corresponds to the method according to the first aspect of embodiment 1. Referring to fig. 4, the apparatus 400 includes: a processor 410; and a memory 420 coupled to the processor 410 for providing instructions to the processor 410 to process the following process steps: acquiring marked data marked by marked personnel;
determining error marking data in the marking data by utilizing a pre-constructed marking data auditing model according to the marking data, wherein the marking data auditing model is trained by utilizing the audited marking data;
and sending the error marking data to the marking personnel.
Determining error labeling data in the labeling data by utilizing a pre-constructed labeling data auditing model according to the labeling data, wherein the error labeling data comprises the following steps:
substituting the labeled data into a pre-constructed labeled data auditing model to obtain the error probability of the labeled data;
and determining the error labeling data in the labeling data according to the error probability of the labeling data and a preset judgment condition.
Optionally, the determining, according to the error probability of the labeled data and a preset judgment condition, the error labeled data in the labeled data includes:
arranging the labeled data according to the sequence of the error probability from large to small;
calculating the preset number of the error marking data in the marking data according to the preset proportion;
and determining the preset number of the arranged marking data as the error marking data.
Optionally, determining the error labeled data in the labeled data according to the error probability of the labeled data and a preset judgment condition, including:
screening a preset amount of the labeled data from the labeled data for auditing to obtain the error rate of the labeled data;
determining the number of the error marked data needing to be checked and marked when the marked data reach the preset correct rate according to the error rate of the marked data and the preset correct rate of the marked data;
arranging the labeled data according to the sequence of the error probability from large to small;
and sequentially auditing and labeling the arranged labeled data until the number of the error labeled data is determined to reach the number of the error labeled data needing to be audited and labeled.
Optionally, before obtaining the annotation data annotated by the annotating staff, the method includes:
obtaining audited annotation data, wherein the audited annotation data comprises annotation information and audit information;
and obtaining a labeled data auditing model through a supervising machine learning module according to the labeled information and the auditing information.
Optionally, obtaining a labeled data auditing model through a supervised machine learning module according to the labeled information and the auditing information, including:
determining an auditing result of the audited marked data according to the marked information in the audited marked data and the auditing information;
and substituting the labeling information, the auditing information and the auditing result into a supervision machine learning module for training to obtain a labeling data auditing model.
Optionally, the supervised machine learning module is one or more of a convolutional neural network, a binary vector algorithm, a deep neural network, and a logistic regression algorithm.
In the embodiment of the invention, the marked data marked by the marked personnel is obtained, the pre-constructed marked data auditing model is utilized according to the marked data (wherein the marked data auditing model is obtained by utilizing the marked data which is audited in a training way), the error marked data in the marked data is determined, and the error marked data is sent to the marked personnel. Therefore, in the embodiment, the error annotation data in the annotation data is determined by using the pre-constructed annotation data auditing model, so that the annotation personnel can re-annotate the error annotation data, and the efficiency of re-annotating the annotation data is improved. In addition, in the embodiment, the annotation data audit model is obtained by training the audited annotation data, so that the utilization rate of the audited annotation data is improved. Therefore, the technical scheme of the embodiment improves the utilization rate of the checked labeled data and the labeling efficiency of labeling the labeled data again.
The processing device for the labeled data provided by the embodiment of the application can realize each process in the foregoing method embodiments, and achieve the same functions and effects, which are not repeated here.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (10)

1. A method for processing annotation data, comprising:
acquiring marked data marked by marked personnel;
determining error marking data in the marking data by utilizing a pre-constructed marking data auditing model according to the marking data, wherein the marking data auditing model is trained by utilizing the audited marking data; and sending the error marking data to the marking personnel.
2. The method of claim 1, wherein determining the error annotation data in the annotation data according to the annotation data by using a pre-constructed annotation data auditing model comprises:
substituting the labeled data into a pre-constructed labeled data auditing model to obtain the error probability of the labeled data;
and determining the error labeling data in the labeling data according to the error probability of the labeling data and a preset judgment condition.
3. The method of claim 2, wherein determining the error labeled data in the labeled data according to the error probability of the labeled data and a preset judgment condition comprises:
arranging the labeled data according to the sequence of the error probability from large to small;
calculating the preset number of the error marking data in the marking data according to a preset proportion;
and determining the preset number of the arranged marking data as the error marking data.
4. The method of claim 2, wherein determining the error labeled data in the labeled data according to the error probability of the labeled data and a preset judgment condition comprises:
screening a preset amount of the labeled data from the labeled data for auditing to obtain the error rate of the labeled data;
determining the number of the error marked data needing to be checked and marked when the marked data reach the preset correct rate according to the error rate of the marked data and the preset correct rate of the marked data;
arranging the labeled data according to the sequence of the error probability from large to small;
and sequentially auditing and labeling the arranged labeled data until the number of the error labeled data is determined to reach the number of the error labeled data needing to be audited and labeled.
5. The method of claim 1, before obtaining the annotation data annotated by the annotated person, comprising:
obtaining audited annotation data, wherein the audited annotation data comprises annotation information and audit information;
and obtaining a labeled data auditing model through a supervising machine learning module according to the labeled information and the auditing information.
6. The method of claim 5, wherein obtaining, by a supervised machine learning module, a labeled data audit model based on the labeling information and the audit information comprises:
determining an auditing result of the audited marked data according to the marked information in the audited marked data and the auditing information;
and substituting the labeling information, the auditing information and the auditing result into a supervision machine learning module for training to obtain a labeling data auditing model.
7. The method of any one of claims 5 and 6, wherein the supervised machine learning module is one or more of a convolutional neural network, a binary vector algorithm, a deep neural network, and a logistic regression algorithm.
8. A storage medium characterized by comprising a stored program, wherein the processing method of the annotation data according to any one of claims 1 to 7 is executed by a processor when the program is executed.
9. A device for processing annotation data, comprising:
the annotation data acquisition module is used for acquiring annotation data annotated by the annotation personnel;
the error data determining module is used for determining error marking data in the marking data according to the marking data by utilizing a pre-constructed marking data auditing model, wherein the marking data auditing model is trained by utilizing the approved marking data;
and the marking data sending module is used for sending the error marking data to the marking personnel.
10. A device for processing annotation data, comprising:
a processor;
a memory coupled to the processor for providing instructions to the processor for processing the following processing steps:
acquiring marked data marked by marked personnel;
determining error marking data in the marking data by utilizing a pre-constructed marking data auditing model according to the marking data, wherein the marking data auditing model is trained by utilizing the audited marking data;
and sending the error marking data to the marking personnel.
CN201911056493.XA 2019-10-31 2019-10-31 Method, device and medium for processing labeled data Pending CN112749725A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911056493.XA CN112749725A (en) 2019-10-31 2019-10-31 Method, device and medium for processing labeled data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911056493.XA CN112749725A (en) 2019-10-31 2019-10-31 Method, device and medium for processing labeled data

Publications (1)

Publication Number Publication Date
CN112749725A true CN112749725A (en) 2021-05-04

Family

ID=75645667

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911056493.XA Pending CN112749725A (en) 2019-10-31 2019-10-31 Method, device and medium for processing labeled data

Country Status (1)

Country Link
CN (1) CN112749725A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109389275A (en) * 2017-08-08 2019-02-26 北京图森未来科技有限公司 A kind of image labeling method and device
CN109635110A (en) * 2018-11-30 2019-04-16 北京百度网讯科技有限公司 Data processing method, device, equipment and computer readable storage medium
CN110263322A (en) * 2019-05-06 2019-09-20 平安科技(深圳)有限公司 Audio for speech recognition corpus screening technique, device and computer equipment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109389275A (en) * 2017-08-08 2019-02-26 北京图森未来科技有限公司 A kind of image labeling method and device
CN109635110A (en) * 2018-11-30 2019-04-16 北京百度网讯科技有限公司 Data processing method, device, equipment and computer readable storage medium
CN110263322A (en) * 2019-05-06 2019-09-20 平安科技(深圳)有限公司 Audio for speech recognition corpus screening technique, device and computer equipment

Similar Documents

Publication Publication Date Title
CN112380859A (en) Public opinion information recommendation method and device, electronic equipment and computer storage medium
CN111209931A (en) Data processing method, platform, terminal device and storage medium
CN107590460A (en) Face classification method, apparatus and intelligent terminal
CN114398560B (en) Marketing interface setting method, device, equipment and medium based on WEB platform
CN114491047A (en) Multi-label text classification method and device, electronic equipment and storage medium
CN112906361A (en) Text data labeling method and device, electronic equipment and storage medium
CN113890712A (en) Data transmission method and device, electronic equipment and readable storage medium
CN114722091A (en) Data processing method, data processing device, storage medium and processor
CN111625567A (en) Data model matching method, device, computer system and readable storage medium
CN113268665A (en) Information recommendation method, device and equipment based on random forest and storage medium
CN112633988A (en) User product recommendation method and device, electronic equipment and readable storage medium
CN106651408B (en) Data analysis method and device
CN112749725A (en) Method, device and medium for processing labeled data
CN114168565B (en) Backtracking test method, device and system of business rule model and decision engine
CN111026946A (en) Page information extraction method, device, medium and equipment
CN108572948A (en) The processing method and processing device of doorplate information
CN112749150B (en) Error labeling data identification method, device and medium
CN110728138A (en) News text recognition method and device and storage medium
CN108510071B (en) Data feature extraction method and device and computer readable storage medium
CN114723488B (en) Course recommendation method and device, electronic equipment and storage medium
CN109522210A (en) Interface testing parameters analysis method, device, electronic device and storage medium
CN110826582A (en) Image feature training method, device and system
CN113434775B (en) Method and device for determining search content
CN113392105B (en) Service data processing method and terminal equipment
CN115934685A (en) Data checking and correcting method, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210504

RJ01 Rejection of invention patent application after publication