CN109033220B

CN109033220B - Automatic selection method, system, equipment and storage medium of labeled data

Info

Publication number: CN109033220B
Application number: CN201810712204.6A
Authority: CN
Inventors: 王科; 郭鹏
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2018-06-29
Filing date: 2018-06-29
Publication date: 2022-09-06
Anticipated expiration: 2038-06-29
Also published as: CN109033220A

Abstract

The invention discloses an automatic selection method, a system, equipment and a storage medium of label data, wherein the automatic selection method comprises the following steps: acquiring data to be marked; judging whether the data to be annotated is structured data or unstructured data; if the data is structured data, acquiring multiple marked data after marking the data to be marked for multiple times; selecting the marked data with the most repetition times in the plurality of marked data as target marked data; if the data is unstructured data, obtaining marked data marked with the data to be marked; judging whether the marked data pass the examination and verification according to a reference marking database, wherein a plurality of reference marking data are stored in the reference marking database; and if the target annotation data passes the verification, selecting the annotated data as the target annotation data. The invention adopts different modes to automatically select the marked data which accords with the preset rule as the target marking data aiming at the structured and unstructured data to be marked, thereby saving the cost and improving the efficiency and the quality.

Description

Automatic selection method, system, equipment and storage medium of label data

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to an automatic selection method, system, device, and storage medium for labeled data.

Background

Currently, for review and selection of labeled data, the mainstream method is to submit original data after manual labeling, and the data receiver personnel review the data quality item by item. Specifically, the mainstream method comprises the following steps: preparing a marking tool; starting manual labeling by logging in a labeling tool; submitting the marked data for auditing after marking is finished; and the data receiver audits manually one by one.

The method for auditing and selecting the marked data consumes a large amount of labor cost in a manual strip-by-strip auditing mode in the face of a simple auditing scene; in the face of complex auditing scenes (for example, words and parts of speech of a sentence are labeled), the quality of labeled data is difficult to guarantee by ordinary people for auditing. In addition, because the labeling data are finally provided for the model algorithm, the labeling data are large in demand, and a large amount of time cost and labor cost are consumed by a manual item-by-item auditing mode.

Disclosure of Invention

The invention aims to overcome the defects that the mode of manually checking item by item to select the marked data in the prior art is time-consuming and labor-consuming, and provides an automatic marked data selecting method, system, equipment and storage medium.

The invention solves the technical problems through the following technical scheme:

an automatic selection method of labeled data is characterized in that the automatic selection method comprises the following steps:

acquiring data to be marked;

judging whether the data to be marked is structured data or unstructured data;

if the data is structured data, acquiring multiple marked data after marking the data to be marked for multiple times;

selecting the marked data with the most repetition times in the plurality of marked data as target marked data;

if the data to be marked is the unstructured data, marked data after marking the data to be marked are obtained;

judging whether the labeled data passes the examination or not according to a reference label database, wherein a plurality of reference label data are stored in the reference label database;

and if the data passes the verification, selecting the labeled data as target labeled data.

Preferably, the step of judging whether the labeled data passes the audit according to the reference labeling database includes:

judging whether the similarity between the labeled data and the reference labeled data is within a first threshold value range;

and if so, the marked data passes the audit.

Preferably, when the data to be labeled is unstructured data, the step of obtaining labeled data after labeling the data to be labeled includes:

extracting the reference marking data from the reference marking database;

acquiring reference data to be annotated before the reference annotation data is annotated;

acquiring marked data obtained after marking the data to be marked and re-marking the reference data to be marked;

the step of judging whether the similarity between the labeled data and the preset reference labeled data is within a first threshold value range specifically includes:

and judging whether the similarity between the marked data obtained by re-marking the data to be referenced and the reference marked data is within a first threshold range.

Preferably, when the similarity is within the first threshold range, the automatic selection method further includes:

judging whether the similarity is in a second threshold range, wherein the minimum value of the second threshold range is not smaller than the minimum value of the first threshold range;

and if so, adding the target annotation data serving as new reference annotation data into the reference annotation database.

Preferably, the automatic selection method further includes:

storing the target labeling data to a training database;

the training database is used for providing training data for an algorithm model, wherein the algorithm model comprises: at least one of a text recognition model, an image recognition model, a voice recognition model, a video recognition model.

An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements any one of the above methods for automatically selecting labeled data when executing the computer program.

A computer-readable storage medium, on which a computer program is stored, wherein the computer program, when being executed by a processor, performs the steps of any one of the above methods for automatically selecting annotation data.

An automatic annotation data selection system, comprising:

the first acquisition module is used for acquiring data to be marked;

the first judging module is used for judging whether the data to be marked is structured data or unstructured data;

the second acquisition module is used for acquiring multiple marked data after marking the data to be marked for multiple times when the data to be marked is structured data;

the first selection module is used for selecting the marked data with the most repetition times from the multiple marked data as target marked data when the data to be marked is structured data;

a third obtaining module, configured to obtain labeled data after labeling the data to be labeled when the data to be labeled is unstructured data;

the second judging module is used for judging whether the marked data pass the examination and verification according to a reference marking database when the data to be marked are unstructured data, and the reference marking database stores a plurality of reference marking data;

and the second selection module is used for selecting the marked data as the target marking data when the second judgment module judges that the data is positive.

Preferably, the second determining module is specifically configured to: judging whether the similarity between the labeled data and the reference labeled data is within a first threshold value range;

and if so, the marked data passes the audit.

Preferably, the third obtaining module includes:

the extracting unit is used for extracting the reference marking data from the reference marking database;

a first obtaining unit, configured to obtain data to be referenced before the referenced annotation data is annotated;

the second acquisition unit is used for acquiring the marked data after marking the data to be marked and re-marking the reference data to be marked;

the second determination module is specifically configured to: and judging whether the similarity between the marked data after the data to be marked is marked again and the reference marked data is within a first threshold range.

Preferably, when the second determination unit determines that the second determination unit is yes, the automatic selection system further includes:

a third judging module, configured to, when the second judging module judges that the similarity is within a second threshold range, judge whether a minimum value of the second threshold range is not smaller than a minimum value of the first threshold range;

and the expansion module is used for adding the target annotation data serving as new reference annotation data into the reference annotation database when the third judgment module judges that the target annotation data is positive.

Preferably, the automatic selection system further comprises:

the storage module is used for storing the target marking data to a training database;

The positive progress effects of the invention are as follows: the method and the device automatically select the marked data which meet the preset rules as the target marked data by adopting different modes respectively aiming at whether the data to be marked is structured data or unstructured data, not only saves the cost of manually checking and selecting the marked data and improves the efficiency of checking and selecting the marked data, but also can avoid misoperation in the manual checking and selecting process and improve the quality of checking and selecting the marked data.

Drawings

Fig. 1 is a flowchart of an automatic selecting method of annotation data according to embodiment 1 of the invention.

FIG. 2 is a flowchart illustrating the step S5 of the method for automatically selecting annotation data according to embodiment 1 of the present invention.

Fig. 3 is a schematic diagram of a hardware structure of an electronic device according to embodiment 2 of the present invention.

FIG. 4 is a block diagram of an automatic annotation data selection system according to embodiment 4 of the present invention.

Detailed Description

The invention is further illustrated by the following examples, which are not intended to limit the invention thereto.

Example 1

The embodiment provides an automatic selecting method of annotation data, which is used for automatically selecting annotation data meeting the auditing requirement from manually annotated annotation data as final target annotation data. Fig. 1 shows a flowchart of the present embodiment, and referring to fig. 1, the automatic selection method of the present embodiment includes:

s1, acquiring data to be annotated;

specifically, the data to be annotated may include, but is not limited to, text data, image data, voice data, video data.

S2, judging whether the data to be annotated is structured data or unstructured data;

if the data is structured data, go to step S3; if the data is unstructured, go to step S5;

in particular, structured data refers to data that can be represented and stored in a two-dimensional form using a relational database. For example, the structured data to be labeled may be attributes such as the category of the object. Unstructured data refers to data that has no fixed structure. For example, the unstructured data to be labeled can be face points, road conditions, and the like.

S3, acquiring multiple marked data after marking the data to be marked for multiple times;

s4, selecting the marked data with the most repetition times in the plurality of marked data as the target marked data, and turning to the step S8;

in steps S3 and S4, when the data to be annotated is structured data, the data to be annotated is handed to a plurality of people for manual annotation, and a plurality of pieces of annotated data of the data to be annotated can be obtained, for example, when the data to be annotated is a category attribute of an object, the data to be annotated can be represented as clothing, luggage, beauty, digital home appliances, and the like. Specifically, for example, when the data to be labeled is the category attribute of sports shoes, the labeled data may include 8 shoes, and if 3 sports shoes are outdoors, the labeled data with the most repetition times among 11 (an odd number greater than or equal to 3) labeled data, that is, the labeled data shoes with the most repetition times of the same labeled data, are selected as the target labeled data.

S5, obtaining the marked data after marking the data to be marked;

s6, judging whether the labeled data pass the examination according to a reference labeling database, wherein the reference labeling database stores a plurality of reference labeling data;

if the verification is passed, go to step S7;

s7, selecting the marked data as the target marking data, and turning to the step S8;

in steps S5-S7, when the data to be annotated is unstructured data, the data to be annotated is manually annotated by a person to obtain annotated data of a single piece of the data to be annotated, for example, when the data to be annotated is a face point, a certain number of annotations may be performed on a face image according to a set sequence to obtain annotated data, and whether the annotated data passes the audit is determined by using a reference annotation database storing a plurality of reference annotation data, so as to determine whether the annotated data is used as target annotation data.

In step S6 of the present embodiment, it can be determined whether the labeled data passes the audit by determining whether the similarity between the labeled data and the reference labeled data is within a first threshold range. Specifically, referring to fig. 2, step S5 may include:

s51, extracting the reference annotation data from the reference annotation database;

s52, acquiring reference to-be-labeled data before the reference labeling data is labeled;

s53, obtaining the marked data which are marked by the data to be marked and marked again by the reference data to be marked;

step S6 may include: judging whether the similarity between the labeled data and the reference labeled data after the reference data to be labeled is re-labeled is within a first threshold range;

for example, the data to be labeled is 90 face images, 10 labeled face images can be extracted from the reference mark database as the reference labeling data of the 90 faces, and the original image when the 10 marked face images are not marked is obtained as reference data to be marked, and then, manually labeling 90 face images and 10 extracted original images of the face images by one person, comparing and judging whether the similarity between the labeled data obtained by re-labeling the 10 extracted original images of the face images and the extracted reference labeled data is within a first threshold range (the first threshold range can be set according to actual needs, for example, can be 70% -100%), if yes, the annotation data of the 90 face images is checked and selected as the target annotation data of the 90 faces.

S8, storing the target labeling data to a training database;

wherein, the training database is used for providing training data for the algorithm model, and the algorithm model may include: at least one of a text recognition model, an image recognition model, a voice recognition model, a video recognition model. Specifically, the target labeling data in this embodiment may be used to label attributes such as the type of the object, and may also be applicable to the fields of face unlocking, face payment, access control, automatic driving, and the like.

In this embodiment, the reference annotation data in the reference annotation database may be selected in advance based on experience, or may be automatically updated during the process of selecting the annotation data. Specifically, when the similarity is determined to be within the first threshold range in step S6, it may be further determined whether the similarity is within a second threshold range, where a minimum value of the second threshold range is not smaller than a minimum value of the first threshold range, and if yes, the target annotation data is added to the reference annotation database as new reference annotation data. Therefore, the target annotation data with higher quality is added to the reference annotation database, and the updating and the expansion of the reference annotation database are realized.

In the embodiment, for whether the data to be annotated is structured data or unstructured data, different modes are respectively adopted to automatically select the annotated data which meets the preset rule as the target annotation data, and the mode of automatically selecting the annotated data not only saves the cost of manually checking and selecting the annotated data, improves the efficiency of checking and selecting the annotated data, but also can avoid misoperation in the process of manually checking and selecting, and improves the quality of checking and selecting the annotated data.

Example 2

The present embodiment provides an electronic device, which may be represented in the form of a computing device (for example, may be a server device), and includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the automatic selection method of annotation data provided in embodiment 1.

Fig. 3 shows a schematic diagram of a hardware structure of the present embodiment, and as shown in fig. 3, the electronic device 9 specifically includes:

at least one processor 91, at least one memory 92, and a bus 93 for connecting the various system components (including the processor 91 and the memory 92), wherein:

the bus 93 includes a data bus, an address bus, and a control bus.

Memory 92 includes volatile memory, such as Random Access Memory (RAM)921 and/or cache memory 922, and can further include Read Only Memory (ROM) 923.

Memory 92 also includes a program/utility 925 having a set (at least one) of program modules 924, such program modules 924 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

The processor 91 executes various functional applications and data processing, such as the automatic selection method of annotation data provided in embodiment 1 of the present invention, by running the computer program stored in the memory 92.

The electronic device 9 may further communicate with one or more external devices 94 (e.g., a keyboard, a pointing device, etc.). Such communication may be through an input/output (I/O) interface 95. Also, the electronic device 9 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via the network adapter 96. The network adapter 96 communicates with the other modules of the electronic device 9 via the bus 93. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 9, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID (disk array) systems, tape drives, and data backup storage systems, to name a few.

It should be noted that although in the above detailed description several units/modules or sub-units/modules of the electronic device are mentioned, such a division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the units/modules described above may be embodied in one unit/module, according to embodiments of the application. Conversely, the features and functions of one unit/module described above may be further divided into embodiments by a plurality of units/modules.

Example 3

The present embodiment provides a computer-readable storage medium on which a computer program is stored, which when executed by a processor implements the steps of the automatic selection method of annotation data provided in embodiment 1.

More specific examples, among others, that the readable storage medium may employ may include, but are not limited to: a portable disk, a hard disk, random access memory, read only memory, erasable programmable read only memory, optical storage device, magnetic storage device, or any suitable combination of the foregoing.

In a possible implementation manner, the present invention can also be implemented in the form of a program product, which includes program code for causing a terminal device to execute the steps of implementing the automatic selection method of annotation data in embodiment 1 when the program product runs on the terminal device.

Where program code for carrying out the invention is written in any combination of one or more programming languages, the program code may be executed entirely on the user device, partly on the user device, as a stand-alone software package, partly on the user device and partly on a remote device or entirely on the remote device.

Example 4

The present embodiment provides an automatic selecting system for annotation data, which is configured to automatically select, from manually annotated annotation data, annotation data that meets an audit requirement as final target annotation data. Fig. 4 shows a schematic block diagram of the present embodiment, and referring to fig. 4, the automatic selection system of the present embodiment includes:

the first obtaining module 1 is configured to obtain data to be labeled.

The first judging module 2 is configured to judge whether the data to be annotated is structured data or unstructured data.

In particular, structured data refers to data that can be represented and stored in a two-dimensional form using a relational database. For example, the structured data to be labeled can be attributes such as the category of the object. Unstructured data refers to data that has no fixed structure. For example, the unstructured data to be labeled can be face points, road conditions, and the like.

And the second obtaining module 3 is configured to obtain multiple pieces of labeled data after the data to be labeled is labeled for multiple times when the first determining module 2 determines that the data to be labeled is structured data.

The first selecting module 4 is configured to select, when the first determining module 2 determines that the data to be labeled is structured data, labeled data with the largest number of repetitions among the multiple sets of labeled data as target labeled data.

Specifically, when the data to be labeled is structured data, the data to be labeled is handed to a plurality of people for manual labeling, and the second obtaining module 3 may obtain a plurality of pieces of labeled data of the data to be labeled, for example, when the data to be labeled is a category attribute of an object, the data to be labeled may be represented as clothing, bags, cosmetics, digital home appliances, and the like. Specifically, for example, when the data to be labeled is the category attribute of sports shoes, the labeled data may include 8 shoes, and 3 sports shoes are outdoors, the first selection module 4 selects the labeled data with the most repetition times among 11 (an odd number greater than or equal to 3) labeled data, that is, the labeled data shoes with the most repetition times of the same labeled data, as the target labeled data.

And the third obtaining module 5 is configured to obtain the marked data after marking the data to be marked when the first determining module 2 determines that the data to be marked is the unstructured data.

And the second judging module 6 is configured to, when the first judging module 2 judges that the data to be labeled is unstructured data, judge whether the labeled data passes the audit according to a reference label database, where a plurality of reference label data are stored in the reference label database.

And the second selecting module 7 is configured to select the labeled data as the target labeled data when the second judging module 6 judges that the target labeled data is true.

Specifically, when the data to be annotated is unstructured data, the data to be annotated is manually annotated by one person, the third obtaining module 5 obtains single annotated data of the data to be annotated, for example, when the data to be annotated is a face point, a certain number of annotations may be performed on a face image according to a set sequence to obtain annotated data, and the second determining module 6 determines whether the annotated data passes an audit through a reference annotation database in which a plurality of reference annotation data are stored to determine whether the annotated data is used as target annotation data.

In this embodiment, the second judging module 6 may judge whether the labeled data passes the audit by judging whether the similarity between the labeled data and the reference labeled data is within a first threshold range. Specifically, the third obtaining module 5 may include:

an extracting unit 51 for extracting the reference annotation data from the reference annotation database;

a first obtaining unit 52, configured to obtain reference to-be-labeled data before the reference labeling data is labeled;

the second obtaining unit 53 is configured to obtain labeled data obtained by labeling the data to be labeled and labeling the data to be labeled again.

The second determination module 6 may specifically be configured to: and judging whether the similarity between the marked data and the reference marked data after the reference data to be marked are marked again is within a first threshold range.

For example, the data to be annotated is 90 face images, the extracting unit 51 may extract 10 annotated face images from the reference mark database as reference annotation data of the 90 faces, the first obtaining unit 52 obtains original images of the 10 annotated face images when the face images are not annotated as reference annotation data, then the 90 face images and the original images of the 10 extracted face images are handed to one person for manual annotation, the second judging module 6 compares and judges whether the similarity between the annotated data obtained by re-annotating the original images of the 10 extracted face images and the extracted reference annotation data is within a first threshold range (the first threshold range may be set according to actual needs, for example, may be 70% to 100%), the annotation data of the 90 face images is passed through examination, the second selecting module 7 selects the target annotation data as the target annotation data of the 90 faces.

And the storage module 8 is used for storing the target labeling data to the training database.

In this embodiment, the reference annotation data in the reference annotation database may be selected in advance based on experience, or may be automatically updated during the process of selecting the annotation data. Specifically, when the second determining module 6 determines that the similarity is within the first threshold range, the automatic selection system of this embodiment may further include a third determining module and an expanding module, where the third determining module is configured to determine whether the similarity is within the second threshold range (the minimum value of the second threshold range is not smaller than the minimum value of the first threshold range), and if so, the expanding module is configured to add the target annotation data as new reference annotation data to the reference annotation database. Therefore, the target annotation data with higher quality is added to the reference annotation database, and the updating and the expansion of the reference annotation database are realized.

While specific embodiments of the invention have been described above, it will be understood by those skilled in the art that this is by way of example only, and that the scope of the invention is defined by the appended claims. Various changes and modifications to these embodiments may be made by those skilled in the art without departing from the spirit and scope of the invention, and these changes and modifications are within the scope of the invention.

Claims

1. An automatic selection method of labeled data is characterized in that the automatic selection method comprises the following steps:

acquiring data to be marked;

judging whether the data to be marked is structured data or unstructured data;

if the data is unstructured data, obtaining single marked data after marking the data to be marked;

judging whether the labeled data pass the examination or not according to a plurality of reference labeling data stored in a reference labeling database;

if the marked data pass the verification, selecting the marked data as target marked data;

the step of judging whether the marked data passes the audit according to the reference marking database comprises the following steps:

judging whether the similarity between the labeled data and the reference labeled data is within a first threshold value range or not;

if yes, the marked data pass the verification;

when the data to be labeled is unstructured data, the step of acquiring labeled data after labeling the data to be labeled comprises the following steps:

extracting the reference marking data from the reference marking database;

acquiring reference data to be marked before the reference marking data is marked;

2. The method of claim 1, wherein when the similarity is within the first threshold, the method further comprises:

3. The method of automatically selecting annotation data of claim 1, further comprising:

storing the target labeling data to a training database;

4. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method for automatic selection of annotation data according to any of claims 1 to 3 when executing the computer program.

5. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method for automatic selection of annotation data according to any of claims 1 to 3.

6. An automatic annotation data selection system, comprising:

the first acquisition module is used for acquiring data to be marked;

the third acquisition module is used for acquiring marked data after singly marking the data to be marked when the data to be marked is unstructured data;

the second judging module is used for judging whether the marked data pass the auditing according to a plurality of reference marking data stored in a reference marking database when the data to be marked are unstructured data;

the second selection module is used for selecting the marked data as target marking data when the second judgment module judges that the target marking data is true;

the second determination module is specifically configured to: judging whether the similarity between the labeled data and the reference labeled data is within a first threshold value range or not;

if yes, the marked data pass the verification;

the third obtaining module includes:

the first acquisition unit is used for acquiring the reference to-be-labeled data before the reference labeling data is labeled;

the second judgment module is specifically configured to: and judging whether the similarity between the marked data obtained by re-marking the data to be referenced and the reference marked data is within a first threshold range.

7. The system for automatically selecting annotation data according to claim 6, wherein when the second determination module determines yes, the system for automatically selecting annotation data further comprises:

a third determining module, configured to determine whether the similarity is within a second threshold range when the second determining module determines that the similarity is within the second threshold range, where a minimum value of the second threshold range is not smaller than a minimum value of the first threshold range;

and the expanding module is used for adding the target annotation data serving as new reference annotation data into the reference annotation database when the third judging module judges that the target annotation data is positive.

8. The automatic annotation data selection system of claim 6, further comprising: