CN111104479A

CN111104479A - Data labeling method and device

Info

Publication number: CN111104479A
Application number: CN201911106007.0A
Authority: CN
Inventors: 郭泽颖; 林廷懋; 钟伊妮; 柯颖; 陈铭新; 李晓敦; 赵世辉
Original assignee: China Construction Bank Corp; CCB Finetech Co Ltd
Current assignee: China Construction Bank Corp
Priority date: 2019-11-13
Filing date: 2019-11-13
Publication date: 2020-05-05

Abstract

The invention discloses a data labeling method and device, and relates to the technical field of computers. One embodiment of the method comprises: training and generating an annotation model according to a first data set, wherein the first sample data in the first data set is provided with one or more labels; acquiring a second data set to be labeled, labeling the second data set by using a labeling model, determining a label of second sample data in the second data set and a position of the label in the second sample data by using the labeling model, and labeling the position to obtain a labeled third data set; checking third sample data in the third data set, and determining the matching degree of the label of the third sample data in the third data set compared with the true value label; and correcting the third sample data with the matching degree smaller than the first threshold value to obtain a labeling result corresponding to the second data set. The implementation method saves the time and labor for data marking and improves the efficiency of data marking.

Description

Data labeling method and device

Technical Field

The invention relates to the technical field of computers, in particular to a data annotation method and device.

Background

The basis of natural language processing is a labeled data set, i.e., a collection of data labeled and classified according to the knowledge it has.

Currently, data is labeled usually by a manual labeling method, for example, a labeling person labels a document to be labeled according to an existing data label list.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art:

the manual labeling mode is adopted, after understanding the meaning of the document to be labeled, a labeling person selects a proper label from a large number of labels in the data label list, and the selected label is used for labeling the document to be labeled.

Disclosure of Invention

In view of this, embodiments of the present invention provide a data annotation method and apparatus, which can save time and labor for data annotation and improve efficiency of data annotation.

To achieve the above object, according to an aspect of an embodiment of the present invention, a method for data annotation is provided.

The data labeling method of the embodiment of the invention comprises the following steps:

training to generate an annotation model according to a first data set, wherein the first sample data in the first data set has one or more labels;

acquiring a second data set to be labeled, labeling the second data set by using the labeling model, determining a label of second sample data in the second data set and a position of the label in the second sample data by using the labeling model, and labeling the position to obtain a labeled third data set;

checking third sample data in the third data set, and determining the matching degree of the label of the third sample data in the third data set compared with the true value label;

and correcting the third sample data with the matching degree smaller than a first threshold value to obtain a labeling result corresponding to the second data set.

Alternatively,

the correcting the third sample data of which the matching degree is smaller than a first threshold value includes:

according to the truth label, correcting third sample data of which the matching degree is smaller than a first threshold and larger than a second threshold; wherein the second threshold is less than the first threshold.

Optionally, the method further comprises:

and updating the annotation model according to the corrected third sample data and the third sample data of which the matching degree is not less than the first threshold in the third data set.

Alternatively,

and correcting the third sample data with the matching degree smaller than the second threshold value by using the updated model.

Alternatively,

and when a plurality of third sample data with the matching degree smaller than a first threshold value exist, correcting the third sample data in sequence according to the similarity among the plurality of third sample data.

According to a second aspect of the embodiments of the present invention, there is provided an apparatus for data annotation, including: the system comprises a model training module, a labeling module and a correcting module; wherein the content of the first and second substances,

the model training module is used for generating a labeling model according to training of a first data set, wherein first sample data in the first data set is provided with one or more labels;

the labeling module is used for acquiring a second data set to be labeled, labeling the second data set by using the labeling model, determining a label of second sample data in the second data set and a position of the label in the second sample data by using the labeling model, and labeling the position to obtain a labeled third data set;

the correction module is configured to check third sample data in the third data set, determine a matching degree of a label of the third sample data in the third data set compared with a true label, and correct the third sample data whose matching degree is smaller than a first threshold, so as to obtain a labeling result corresponding to the second data set.

Alternatively,

the correction module is configured to correct, according to the truth label, third sample data of which the matching degree is smaller than a first threshold and larger than a second threshold; wherein the second threshold is less than the first threshold.

Alternatively,

the model training module is further configured to update the annotation model according to the corrected third sample data and the third sample data in the third data set, where the matching degree is not less than the first threshold.

Alternatively,

and the marking module is used for correcting the third sample data with the matching degree smaller than the second threshold value by using the updated model.

Alternatively,

and the correction module is used for correcting the third sample data in sequence according to the similarity between the plurality of third sample data when the third sample data with the matching degree smaller than the first threshold is a plurality of third sample data.

To achieve the above object, according to another aspect of the embodiments of the present invention, an electronic device for data annotation is provided.

An electronic device for data annotation according to an embodiment of the present invention includes: one or more processors; a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement a method of data tagging in accordance with an embodiment of the present invention.

To achieve the above object, according to still another aspect of embodiments of the present invention, there is provided a computer-readable storage medium.

A computer-readable storage medium of an embodiment of the present invention stores thereon a computer program, which when executed by a processor implements a method of data annotation of an embodiment of the present invention.

One embodiment of the above invention has the following advantages or benefits: the method comprises the steps of training according to a first data set composed of first sample data with one or more labels to generate a labeling model, labeling second sample data to be labeled by using the labeling model, determining the labels of the second sample data and the positions of the labels in the second sample data during labeling, labeling the corresponding labels and the corresponding positions of the labels, and therefore when third sample data after labeling is corrected, the labeling positions can be directly checked to check the matching degree of the third sample data compared with a true value label, and then correcting the third sample data with the matching degree smaller than a first threshold value. When the second sample data is marked by the marking model, the position corresponding to the label is marked, so that the marked position can be directly checked in the correction process, the labor and time consumed by correction are saved, and the marking efficiency is improved. Furthermore, all or part of the labels of the second sample data can be accurately marked by using the marking model, and a user does not need to mark each label of each second sample data one by one, so that the labor and time consumed by marking are saved, and the marking efficiency is further improved.

Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

FIG. 1 is a schematic diagram of the main steps of a method for data annotation according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of the main steps of another method for data annotation according to the embodiment of the present invention;

FIG. 3 is a schematic diagram of the main modules of a data annotation device according to an embodiment of the invention;

FIG. 4 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;

fig. 5 is a schematic block diagram of a computer system suitable for use in implementing a terminal device or server of an embodiment of the invention.

Detailed Description

Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that the embodiments of the present invention and the technical features of the embodiments may be combined with each other without conflict.

As shown in fig. 1, an embodiment of the present invention provides a method for data annotation, which may include the following steps:

step S101: an annotation model is generated by training according to a first data set, wherein the first sample data in the first data set has one or more labels.

In this step, a first data set composed of first sample data already having one or more labels is trained to generate an annotation model, and for convenience of description, the first data set is named as T0, and the generated annotation model is trained to be M0. The label of the first sample data may be a true label manually marked.

Step S102: and acquiring a second data set to be labeled, labeling the second data set by using the labeling model, determining a label of second sample data in the second data set and a position of the label in the second sample data by using the labeling model, and labeling the position to obtain a labeled third data set.

And labeling the second sample data to be labeled by using the labeling model M0, wherein a third data set obtained after labeling is T, and it can be understood that the third sample data in the third data set T is the labeled second sample data. In the labeling process, the labeling model can automatically determine one or more labels of each second sample data and the positions of the one or more labels in the second sample data, and marks the positions of the labels, so that whether the labels corresponding to the marked positions are correct or not can be directly checked when third sample data is corrected in the later period, and therefore the correction workload can be reduced.

Step S103: and checking third sample data in the third data set, and determining the matching degree of the label of the third sample data in the third data set compared with the true value label.

Since the third sample data automatically marked by the marking model M0 may not be as accurate as possible, in order to improve the accuracy of data marking, the third sample data may be further corrected according to the matching degree of the label of the third sample data compared with the true label. The label of the third sample data is a label automatically marked by using the marking model M0, and the true label is a label manually marked by a marking person. The matching degree of the third sample data label to the true label is the ratio of the correct label to the true label in the third sample data, for example, there are 100 true labels of the third sample data a, and there are 70 correct labels of the third sample data a marked by the marking model, so the matching degree of the third sample data a to the true label is 70%.

Step S104: and correcting the third sample data with the matching degree smaller than a first threshold value to obtain a labeling result corresponding to the second data set.

In order to improve the accuracy of data labeling, third sample data with a matching degree smaller than 100% is corrected, that is, the value of the first threshold may be 100%, and of course, the value of the first threshold may also be adjusted according to actual requirements, so as to adjust the workload of data labeling.

In order to improve the correction efficiency, in an embodiment of the present invention, the third sample data with the matching degree smaller than the first threshold and larger than the second threshold may be corrected according to the truth label; wherein the second threshold is less than the first threshold.

For example, when the first threshold value is 100% and the second threshold value is 80%, the third sample data in the third data set may be classified into three classes according to the matching degree of each third sample data in the third data set compared to the truth label, where one class is a set T1 of the third sample data with a matching degree of 100%, one class is a set T2 of the third sample data with a matching degree of less than 100% and not less than 80%, and the other class is a set T3 of the third sample data with a matching degree of less than 80%.

In order to improve the efficiency of data correction, the third sample data in the set T2 may be corrected manually, in which the correction process modifies or adds the inaccurate label and the label of the missing label of the third sample data in the set T2 as the true label, that is, the matching degree of the third sample data in the set T2 is corrected to 100%, and the corrected set T2 may be classified as the set T1.

In order to facilitate understanding of each third sample data by a annotating person in a manual correction process, so as to improve annotation efficiency, in an embodiment of the present invention, when a plurality of third sample data with matching degrees smaller than a first threshold are provided, the third sample data is corrected in sequence according to similarity between the plurality of third sample data.

For example, for each third training data Di in the set T2, the SimHash value Hi of each Di may be calculated, and the third training data Di are sorted according to the Hi value, that is, Di +1 is Dk corresponding to Hk with the smallest difference from Hi, so that the third training data in the set T2 are presented to the annotator and sequentially arranged according to the similarity order of the third training data, and the annotator sequentially annotates the third training data Di according to the similarity order when correcting, thereby facilitating the understanding of the user on the text sample data, and further improving the annotation efficiency.

In order to continuously optimize the annotation model, the annotation model may be continuously optimized according to the corrected data, that is, the annotation model may be updated according to the corrected third sample data and the third sample data in the third data set, where the matching degree is not less than the first threshold. For example, a new annotation model M1 may be generated by training based on a new training data set consisting of the first data set T0, the set T1, and the corrected T2.

Then, the updated model M1 may be used to correct the third sample data with the matching degree smaller than the second threshold, that is, the third data in the set T3 with the matching degree smaller than 80% is re-labeled to form a new training data set, so as to correct the third training data in the set T3.

It can be understood that, after the set T3 is re-labeled by using the new labeling model, the sample data in the generated data set can still be classified into three types, one type is sample data with a matching degree greater than or equal to a first threshold (sample data with a matching degree of 100%), one type is sample data with a matching degree smaller than the first threshold and not smaller than a second threshold (sample data with a matching degree smaller than 100% and not smaller than 80%), and the other type is sample data with a matching degree smaller than the second threshold (sample data with a matching degree smaller than 80%). Then, the manual correction can still be performed on the sample data with the matching degree smaller than 100% and not smaller than 80% according to the above method, and the labeling model M1 is further updated according to the corrected sample data and the sample data with the matching degree of 100%. With the loop, the sample data is continuously corrected and the annotation model is updated to improve the standard accuracy until the third sample data in M3 is corrected or new sample data to be annotated is received.

In summary, an embodiment of the present invention provides a method for data annotation, where the method may include the steps shown in fig. 2:

step S201: an annotation model is generated by training according to a first data set, wherein the first sample data in the first data set has one or more labels.

Step S202: and acquiring a second data set to be labeled, labeling the second data set by using the labeling model, determining a label of second sample data in the second data set and a position of the label in the second sample data by using the labeling model, and labeling the position to obtain a labeled third data set.

Step S203: and checking third sample data in the third data set, and determining the matching degree of the label of the third sample data in the third data set compared with the true value label.

Step S204: according to the truth label, correcting third sample data of which the matching degree is smaller than a first threshold and larger than a second threshold; wherein the second threshold is less than the first threshold.

Step S205: and updating the annotation model according to the corrected third sample data and the third sample data of which the matching degree is not less than the first threshold in the third data set.

Step S206: and correcting the third sample data with the matching degree smaller than the second threshold value by using the updated model.

As shown in fig. 3, an embodiment of the present invention provides an apparatus 300 for data annotation, including: a model training module 301, a labeling module 302 and a correction module 303; wherein the content of the first and second substances,

the model training module 301 is configured to generate a labeling model according to training of a first data set, where the first sample data in the first data set has one or more labels;

the labeling module 302 is configured to obtain a second data set to be labeled, label the second data set by using the labeling model, determine a label of second sample data in the second data set and a position of the label in the second sample data by using the labeling model, and mark the position to obtain a labeled third data set;

the correcting module 303 is configured to verify third sample data in the third data set, determine a matching degree of a label of the third sample data in the third data set compared with a true label, and correct the third sample data whose matching degree is smaller than a first threshold, so as to obtain a labeling result corresponding to the second data set.

In an embodiment of the present invention, the correcting module 303 is configured to correct, according to the truth label, third sample data of which the matching degree is smaller than a first threshold and larger than a second threshold; wherein the second threshold is less than the first threshold.

In an embodiment of the present invention, the model training module 301 is further configured to update the annotation model according to the corrected third sample data and the third sample data in the third data set, where the matching degree is not smaller than the first threshold.

In an embodiment of the present invention, the labeling module 302 is configured to correct the third sample data with the matching degree smaller than the second threshold value by using the updated model.

In an embodiment of the present invention, the correcting module 303 is configured to, when a plurality of third sample data with the matching degree smaller than the first threshold are provided, correct the third sample data in sequence according to the similarity between the plurality of third sample data.

An embodiment of the present invention further provides an electronic device, including: one or more processors; storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to carry out a method according to any one of the preceding embodiments.

An embodiment of the present invention further provides a computer-readable medium, on which a computer program is stored, where the computer program is configured to, when executed by a processor, implement the method according to any one of the above embodiments.

FIG. 4 illustrates an exemplary system architecture 400 of a data annotation apparatus or method to which embodiments of the invention may be applied.

As shown in fig. 4, the system architecture 400 may include

terminal devices

401, 402, 403, a network 404, and a server 405. The network 404 serves as a medium for providing communication links between the

terminal devices

401, 402, 403 and the server 405. Network 404 may include various types of connections, such as wire, wireless communication links, or fiber optic cables, to name a few.

A user may use

terminal devices

401, 402, 403 to interact with a server 405 over a network 404 to receive or send messages or the like. The

terminal devices

401, 402, 403 may have various communication client applications installed thereon, such as shopping applications, web browser applications, search applications, instant messaging tools, mailbox clients, social platform software, and the like.

The

terminal devices

401, 402, 403 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 405 may be a server that provides various services, such as a background management server that supports shopping websites browsed by users using the

terminal devices

401, 402, and 403. The background management server may analyze and perform other processing on the received data such as the product information query request, and feed back a processing result (e.g., target push information and product information) to the terminal device.

It should be noted that the method for data annotation provided by the embodiment of the present invention is generally executed by the server 405, and accordingly, the apparatus for data annotation is generally disposed in the server 405.

It should be understood that the number of terminal devices, networks, and servers in fig. 4 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring now to FIG. 5, shown is a block diagram of a computer system 500 suitable for use with a terminal device implementing an embodiment of the present invention. The terminal device shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 5, the computer system 500 includes a Central Processing Unit (CPU)501 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage section 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for the operation of the system 500 are also stored. The CPU 501, ROM 502, and RAM 503 are connected to each other via a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

The following components are connected to the I/O interface 505: an input portion 506 including a keyboard, a mouse, and the like; an output portion 507 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 508 including a hard disk and the like; and a communication section 509 including a network interface card such as a LAN card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet. The driver 510 is also connected to the I/O interface 505 as necessary. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as necessary, so that a computer program read out therefrom is mounted into the storage section 508 as necessary.

In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 509, and/or installed from the removable medium 511. The computer program performs the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 501.

It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes a model training module, a labeling module, and a correction module. Where the names of these modules do not in some cases constitute a limitation on the module itself, for example, the model training module may also be described as a "module that trains the generated annotation model".

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: training to generate an annotation model according to a first data set, wherein the first sample data in the first data set has one or more labels; acquiring a second data set to be labeled, labeling the second data set by using the labeling model, determining a label of second sample data in the second data set and a position of the label in the second sample data by using the labeling model, and labeling the position to obtain a labeled third data set; checking third sample data in the third data set, and determining the matching degree of the label of the third sample data in the third data set compared with the true value label; and correcting the third sample data with the matching degree smaller than a first threshold value to obtain a labeling result corresponding to the second data set.

According to the technical scheme of the embodiment of the invention, a labeling model is generated by training according to a first data set consisting of first sample data with one or more labels, then second sample data to be labeled is labeled by using the labeling model, the labels of the second sample data and the positions of the labels in the second sample data can be determined during labeling, and the corresponding labels are labeled during labeling, and the corresponding positions of the labels are also labeled, so that when third sample data after labeling is corrected, the labeled positions can be directly checked to check the matching degree of the third sample data compared with a true value label, and then the third sample data with the matching degree smaller than a first threshold value is corrected. When the second sample data is marked by the marking model, the position corresponding to the label is marked, so that the marked position can be directly checked in the correction process, the labor and time consumed by correction are saved, and the marking efficiency is improved. Furthermore, all or part of the labels of the second sample data can be accurately marked by using the marking model, and a user does not need to mark each label of each second sample data one by one, so that the labor and time consumed by marking are saved, and the marking efficiency is further improved.

The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method of data annotation, comprising:

2. The method according to claim 1, wherein said correcting third sample data for which the degree of matching is less than a first threshold value comprises:

3. The method of claim 2, further comprising:

4. The method according to claim 3, wherein said correcting third sample data for which the degree of matching is smaller than a first threshold value comprises:

correcting third sample data with the matching degree smaller than the second threshold value by using the updated model;

and/or the presence of a gas in the gas,

5. An apparatus for annotating data, comprising: the system comprises a model training module, a labeling module and a correcting module; wherein the content of the first and second substances,

6. The apparatus of claim 5,

7. The apparatus of claim 6,

8. The apparatus of claim 7,

the labeling module is configured to correct, by using the updated model, the third sample data of which the matching degree is smaller than the second threshold;

and/or the presence of a gas in the gas,

9. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-4.

10. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-4.