CN111104479A - Data labeling method and device - Google Patents

Data labeling method and device Download PDF

Info

Publication number
CN111104479A
CN111104479A CN201911106007.0A CN201911106007A CN111104479A CN 111104479 A CN111104479 A CN 111104479A CN 201911106007 A CN201911106007 A CN 201911106007A CN 111104479 A CN111104479 A CN 111104479A
Authority
CN
China
Prior art keywords
sample data
data
data set
labeling
threshold
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911106007.0A
Other languages
Chinese (zh)
Inventor
郭泽颖
林廷懋
钟伊妮
柯颖
陈铭新
李晓敦
赵世辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Construction Bank Corp
Original Assignee
China Construction Bank Corp
CCB Finetech Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Construction Bank Corp, CCB Finetech Co Ltd filed Critical China Construction Bank Corp
Priority to CN201911106007.0A priority Critical patent/CN111104479A/en
Publication of CN111104479A publication Critical patent/CN111104479A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data labeling method and device, and relates to the technical field of computers. One embodiment of the method comprises: training and generating an annotation model according to a first data set, wherein the first sample data in the first data set is provided with one or more labels; acquiring a second data set to be labeled, labeling the second data set by using a labeling model, determining a label of second sample data in the second data set and a position of the label in the second sample data by using the labeling model, and labeling the position to obtain a labeled third data set; checking third sample data in the third data set, and determining the matching degree of the label of the third sample data in the third data set compared with the true value label; and correcting the third sample data with the matching degree smaller than the first threshold value to obtain a labeling result corresponding to the second data set. The implementation method saves the time and labor for data marking and improves the efficiency of data marking.

Description

Data labeling method and device
Technical Field
The invention relates to the technical field of computers, in particular to a data annotation method and device.
Background
The basis of natural language processing is a labeled data set, i.e., a collection of data labeled and classified according to the knowledge it has.
Currently, data is labeled usually by a manual labeling method, for example, a labeling person labels a document to be labeled according to an existing data label list.
In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art:
the manual labeling mode is adopted, after understanding the meaning of the document to be labeled, a labeling person selects a proper label from a large number of labels in the data label list, and the selected label is used for labeling the document to be labeled.
Disclosure of Invention
In view of this, embodiments of the present invention provide a data annotation method and apparatus, which can save time and labor for data annotation and improve efficiency of data annotation.
To achieve the above object, according to an aspect of an embodiment of the present invention, a method for data annotation is provided.
The data labeling method of the embodiment of the invention comprises the following steps:
training to generate an annotation model according to a first data set, wherein the first sample data in the first data set has one or more labels;
acquiring a second data set to be labeled, labeling the second data set by using the labeling model, determining a label of second sample data in the second data set and a position of the label in the second sample data by using the labeling model, and labeling the position to obtain a labeled third data set;
checking third sample data in the third data set, and determining the matching degree of the label of the third sample data in the third data set compared with the true value label;
and correcting the third sample data with the matching degree smaller than a first threshold value to obtain a labeling result corresponding to the second data set.
Alternatively,
the correcting the third sample data of which the matching degree is smaller than a first threshold value includes:
according to the truth label, correcting third sample data of which the matching degree is smaller than a first threshold and larger than a second threshold; wherein the second threshold is less than the first threshold.
Optionally, the method further comprises:
and updating the annotation model according to the corrected third sample data and the third sample data of which the matching degree is not less than the first threshold in the third data set.
Alternatively,
the correcting the third sample data of which the matching degree is smaller than a first threshold value includes:
and correcting the third sample data with the matching degree smaller than the second threshold value by using the updated model.
Alternatively,
the correcting the third sample data of which the matching degree is smaller than a first threshold value includes:
and when a plurality of third sample data with the matching degree smaller than a first threshold value exist, correcting the third sample data in sequence according to the similarity among the plurality of third sample data.
According to a second aspect of the embodiments of the present invention, there is provided an apparatus for data annotation, including: the system comprises a model training module, a labeling module and a correcting module; wherein the content of the first and second substances,
the model training module is used for generating a labeling model according to training of a first data set, wherein first sample data in the first data set is provided with one or more labels;
the labeling module is used for acquiring a second data set to be labeled, labeling the second data set by using the labeling model, determining a label of second sample data in the second data set and a position of the label in the second sample data by using the labeling model, and labeling the position to obtain a labeled third data set;
the correction module is configured to check third sample data in the third data set, determine a matching degree of a label of the third sample data in the third data set compared with a true label, and correct the third sample data whose matching degree is smaller than a first threshold, so as to obtain a labeling result corresponding to the second data set.
Alternatively,
the correction module is configured to correct, according to the truth label, third sample data of which the matching degree is smaller than a first threshold and larger than a second threshold; wherein the second threshold is less than the first threshold.
Alternatively,
the model training module is further configured to update the annotation model according to the corrected third sample data and the third sample data in the third data set, where the matching degree is not less than the first threshold.
Alternatively,
and the marking module is used for correcting the third sample data with the matching degree smaller than the second threshold value by using the updated model.
Alternatively,
and the correction module is used for correcting the third sample data in sequence according to the similarity between the plurality of third sample data when the third sample data with the matching degree smaller than the first threshold is a plurality of third sample data.
To achieve the above object, according to another aspect of the embodiments of the present invention, an electronic device for data annotation is provided.
An electronic device for data annotation according to an embodiment of the present invention includes: one or more processors; a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement a method of data tagging in accordance with an embodiment of the present invention.
To achieve the above object, according to still another aspect of embodiments of the present invention, there is provided a computer-readable storage medium.
A computer-readable storage medium of an embodiment of the present invention stores thereon a computer program, which when executed by a processor implements a method of data annotation of an embodiment of the present invention.
One embodiment of the above invention has the following advantages or benefits: the method comprises the steps of training according to a first data set composed of first sample data with one or more labels to generate a labeling model, labeling second sample data to be labeled by using the labeling model, determining the labels of the second sample data and the positions of the labels in the second sample data during labeling, labeling the corresponding labels and the corresponding positions of the labels, and therefore when third sample data after labeling is corrected, the labeling positions can be directly checked to check the matching degree of the third sample data compared with a true value label, and then correcting the third sample data with the matching degree smaller than a first threshold value. When the second sample data is marked by the marking model, the position corresponding to the label is marked, so that the marked position can be directly checked in the correction process, the labor and time consumed by correction are saved, and the marking efficiency is improved. Furthermore, all or part of the labels of the second sample data can be accurately marked by using the marking model, and a user does not need to mark each label of each second sample data one by one, so that the labor and time consumed by marking are saved, and the marking efficiency is further improved.
Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:
FIG. 1 is a schematic diagram of the main steps of a method for data annotation according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of the main steps of another method for data annotation according to the embodiment of the present invention;
FIG. 3 is a schematic diagram of the main modules of a data annotation device according to an embodiment of the invention;
FIG. 4 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;
fig. 5 is a schematic block diagram of a computer system suitable for use in implementing a terminal device or server of an embodiment of the invention.
Detailed Description
Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
It should be noted that the embodiments of the present invention and the technical features of the embodiments may be combined with each other without conflict.
As shown in fig. 1, an embodiment of the present invention provides a method for data annotation, which may include the following steps:
step S101: an annotation model is generated by training according to a first data set, wherein the first sample data in the first data set has one or more labels.
In this step, a first data set composed of first sample data already having one or more labels is trained to generate an annotation model, and for convenience of description, the first data set is named as T0, and the generated annotation model is trained to be M0. The label of the first sample data may be a true label manually marked.
Step S102: and acquiring a second data set to be labeled, labeling the second data set by using the labeling model, determining a label of second sample data in the second data set and a position of the label in the second sample data by using the labeling model, and labeling the position to obtain a labeled third data set.
And labeling the second sample data to be labeled by using the labeling model M0, wherein a third data set obtained after labeling is T, and it can be understood that the third sample data in the third data set T is the labeled second sample data. In the labeling process, the labeling model can automatically determine one or more labels of each second sample data and the positions of the one or more labels in the second sample data, and marks the positions of the labels, so that whether the labels corresponding to the marked positions are correct or not can be directly checked when third sample data is corrected in the later period, and therefore the correction workload can be reduced.
Step S103: and checking third sample data in the third data set, and determining the matching degree of the label of the third sample data in the third data set compared with the true value label.
Since the third sample data automatically marked by the marking model M0 may not be as accurate as possible, in order to improve the accuracy of data marking, the third sample data may be further corrected according to the matching degree of the label of the third sample data compared with the true label. The label of the third sample data is a label automatically marked by using the marking model M0, and the true label is a label manually marked by a marking person. The matching degree of the third sample data label to the true label is the ratio of the correct label to the true label in the third sample data, for example, there are 100 true labels of the third sample data a, and there are 70 correct labels of the third sample data a marked by the marking model, so the matching degree of the third sample data a to the true label is 70%.
Step S104: and correcting the third sample data with the matching degree smaller than a first threshold value to obtain a labeling result corresponding to the second data set.
In order to improve the accuracy of data labeling, third sample data with a matching degree smaller than 100% is corrected, that is, the value of the first threshold may be 100%, and of course, the value of the first threshold may also be adjusted according to actual requirements, so as to adjust the workload of data labeling.
In order to improve the correction efficiency, in an embodiment of the present invention, the third sample data with the matching degree smaller than the first threshold and larger than the second threshold may be corrected according to the truth label; wherein the second threshold is less than the first threshold.
For example, when the first threshold value is 100% and the second threshold value is 80%, the third sample data in the third data set may be classified into three classes according to the matching degree of each third sample data in the third data set compared to the truth label, where one class is a set T1 of the third sample data with a matching degree of 100%, one class is a set T2 of the third sample data with a matching degree of less than 100% and not less than 80%, and the other class is a set T3 of the third sample data with a matching degree of less than 80%.
In order to improve the efficiency of data correction, the third sample data in the set T2 may be corrected manually, in which the correction process modifies or adds the inaccurate label and the label of the missing label of the third sample data in the set T2 as the true label, that is, the matching degree of the third sample data in the set T2 is corrected to 100%, and the corrected set T2 may be classified as the set T1.
In order to facilitate understanding of each third sample data by a annotating person in a manual correction process, so as to improve annotation efficiency, in an embodiment of the present invention, when a plurality of third sample data with matching degrees smaller than a first threshold are provided, the third sample data is corrected in sequence according to similarity between the plurality of third sample data.
For example, for each third training data Di in the set T2, the SimHash value Hi of each Di may be calculated, and the third training data Di are sorted according to the Hi value, that is, Di +1 is Dk corresponding to Hk with the smallest difference from Hi, so that the third training data in the set T2 are presented to the annotator and sequentially arranged according to the similarity order of the third training data, and the annotator sequentially annotates the third training data Di according to the similarity order when correcting, thereby facilitating the understanding of the user on the text sample data, and further improving the annotation efficiency.
In order to continuously optimize the annotation model, the annotation model may be continuously optimized according to the corrected data, that is, the annotation model may be updated according to the corrected third sample data and the third sample data in the third data set, where the matching degree is not less than the first threshold. For example, a new annotation model M1 may be generated by training based on a new training data set consisting of the first data set T0, the set T1, and the corrected T2.
Then, the updated model M1 may be used to correct the third sample data with the matching degree smaller than the second threshold, that is, the third data in the set T3 with the matching degree smaller than 80% is re-labeled to form a new training data set, so as to correct the third training data in the set T3.
It can be understood that, after the set T3 is re-labeled by using the new labeling model, the sample data in the generated data set can still be classified into three types, one type is sample data with a matching degree greater than or equal to a first threshold (sample data with a matching degree of 100%), one type is sample data with a matching degree smaller than the first threshold and not smaller than a second threshold (sample data with a matching degree smaller than 100% and not smaller than 80%), and the other type is sample data with a matching degree smaller than the second threshold (sample data with a matching degree smaller than 80%). Then, the manual correction can still be performed on the sample data with the matching degree smaller than 100% and not smaller than 80% according to the above method, and the labeling model M1 is further updated according to the corrected sample data and the sample data with the matching degree of 100%. With the loop, the sample data is continuously corrected and the annotation model is updated to improve the standard accuracy until the third sample data in M3 is corrected or new sample data to be annotated is received.
In summary, an embodiment of the present invention provides a method for data annotation, where the method may include the steps shown in fig. 2:
step S201: an annotation model is generated by training according to a first data set, wherein the first sample data in the first data set has one or more labels.
Step S202: and acquiring a second data set to be labeled, labeling the second data set by using the labeling model, determining a label of second sample data in the second data set and a position of the label in the second sample data by using the labeling model, and labeling the position to obtain a labeled third data set.
Step S203: and checking third sample data in the third data set, and determining the matching degree of the label of the third sample data in the third data set compared with the true value label.
Step S204: according to the truth label, correcting third sample data of which the matching degree is smaller than a first threshold and larger than a second threshold; wherein the second threshold is less than the first threshold.
Step S205: and updating the annotation model according to the corrected third sample data and the third sample data of which the matching degree is not less than the first threshold in the third data set.
Step S206: and correcting the third sample data with the matching degree smaller than the second threshold value by using the updated model.
As shown in fig. 3, an embodiment of the present invention provides an apparatus 300 for data annotation, including: a model training module 301, a labeling module 302 and a correction module 303; wherein the content of the first and second substances,
the model training module 301 is configured to generate a labeling model according to training of a first data set, where the first sample data in the first data set has one or more labels;
the labeling module 302 is configured to obtain a second data set to be labeled, label the second data set by using the labeling model, determine a label of second sample data in the second data set and a position of the label in the second sample data by using the labeling model, and mark the position to obtain a labeled third data set;
the correcting module 303 is configured to verify third sample data in the third data set, determine a matching degree of a label of the third sample data in the third data set compared with a true label, and correct the third sample data whose matching degree is smaller than a first threshold, so as to obtain a labeling result corresponding to the second data set.
In an embodiment of the present invention, the correcting module 303 is configured to correct, according to the truth label, third sample data of which the matching degree is smaller than a first threshold and larger than a second threshold; wherein the second threshold is less than the first threshold.
In an embodiment of the present invention, the model training module 301 is further configured to update the annotation model according to the corrected third sample data and the third sample data in the third data set, where the matching degree is not smaller than the first threshold.
In an embodiment of the present invention, the labeling module 302 is configured to correct the third sample data with the matching degree smaller than the second threshold value by using the updated model.
In an embodiment of the present invention, the correcting module 303 is configured to, when a plurality of third sample data with the matching degree smaller than the first threshold are provided, correct the third sample data in sequence according to the similarity between the plurality of third sample data.
An embodiment of the present invention further provides an electronic device, including: one or more processors; storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to carry out a method according to any one of the preceding embodiments.
An embodiment of the present invention further provides a computer-readable medium, on which a computer program is stored, where the computer program is configured to, when executed by a processor, implement the method according to any one of the above embodiments.
FIG. 4 illustrates an exemplary system architecture 400 of a data annotation apparatus or method to which embodiments of the invention may be applied.
As shown in fig. 4, the system architecture 400 may include terminal devices 401, 402, 403, a network 404, and a server 405. The network 404 serves as a medium for providing communication links between the terminal devices 401, 402, 403 and the server 405. Network 404 may include various types of connections, such as wire, wireless communication links, or fiber optic cables, to name a few.
A user may use terminal devices 401, 402, 403 to interact with a server 405 over a network 404 to receive or send messages or the like. The terminal devices 401, 402, 403 may have various communication client applications installed thereon, such as shopping applications, web browser applications, search applications, instant messaging tools, mailbox clients, social platform software, and the like.
The terminal devices 401, 402, 403 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
The server 405 may be a server that provides various services, such as a background management server that supports shopping websites browsed by users using the terminal devices 401, 402, and 403. The background management server may analyze and perform other processing on the received data such as the product information query request, and feed back a processing result (e.g., target push information and product information) to the terminal device.
It should be noted that the method for data annotation provided by the embodiment of the present invention is generally executed by the server 405, and accordingly, the apparatus for data annotation is generally disposed in the server 405.
It should be understood that the number of terminal devices, networks, and servers in fig. 4 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Referring now to FIG. 5, shown is a block diagram of a computer system 500 suitable for use with a terminal device implementing an embodiment of the present invention. The terminal device shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 5, the computer system 500 includes a Central Processing Unit (CPU)501 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage section 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for the operation of the system 500 are also stored. The CPU 501, ROM 502, and RAM 503 are connected to each other via a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.
The following components are connected to the I/O interface 505: an input portion 506 including a keyboard, a mouse, and the like; an output portion 507 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 508 including a hard disk and the like; and a communication section 509 including a network interface card such as a LAN card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet. The driver 510 is also connected to the I/O interface 505 as necessary. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as necessary, so that a computer program read out therefrom is mounted into the storage section 508 as necessary.
In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 509, and/or installed from the removable medium 511. The computer program performs the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 501.
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes a model training module, a labeling module, and a correction module. Where the names of these modules do not in some cases constitute a limitation on the module itself, for example, the model training module may also be described as a "module that trains the generated annotation model".
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: training to generate an annotation model according to a first data set, wherein the first sample data in the first data set has one or more labels; acquiring a second data set to be labeled, labeling the second data set by using the labeling model, determining a label of second sample data in the second data set and a position of the label in the second sample data by using the labeling model, and labeling the position to obtain a labeled third data set; checking third sample data in the third data set, and determining the matching degree of the label of the third sample data in the third data set compared with the true value label; and correcting the third sample data with the matching degree smaller than a first threshold value to obtain a labeling result corresponding to the second data set.
According to the technical scheme of the embodiment of the invention, a labeling model is generated by training according to a first data set consisting of first sample data with one or more labels, then second sample data to be labeled is labeled by using the labeling model, the labels of the second sample data and the positions of the labels in the second sample data can be determined during labeling, and the corresponding labels are labeled during labeling, and the corresponding positions of the labels are also labeled, so that when third sample data after labeling is corrected, the labeled positions can be directly checked to check the matching degree of the third sample data compared with a true value label, and then the third sample data with the matching degree smaller than a first threshold value is corrected. When the second sample data is marked by the marking model, the position corresponding to the label is marked, so that the marked position can be directly checked in the correction process, the labor and time consumed by correction are saved, and the marking efficiency is improved. Furthermore, all or part of the labels of the second sample data can be accurately marked by using the marking model, and a user does not need to mark each label of each second sample data one by one, so that the labor and time consumed by marking are saved, and the marking efficiency is further improved.
The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method of data annotation, comprising:
training to generate an annotation model according to a first data set, wherein the first sample data in the first data set has one or more labels;
acquiring a second data set to be labeled, labeling the second data set by using the labeling model, determining a label of second sample data in the second data set and a position of the label in the second sample data by using the labeling model, and labeling the position to obtain a labeled third data set;
checking third sample data in the third data set, and determining the matching degree of the label of the third sample data in the third data set compared with the true value label;
and correcting the third sample data with the matching degree smaller than a first threshold value to obtain a labeling result corresponding to the second data set.
2. The method according to claim 1, wherein said correcting third sample data for which the degree of matching is less than a first threshold value comprises:
according to the truth label, correcting third sample data of which the matching degree is smaller than a first threshold and larger than a second threshold; wherein the second threshold is less than the first threshold.
3. The method of claim 2, further comprising:
and updating the annotation model according to the corrected third sample data and the third sample data of which the matching degree is not less than the first threshold in the third data set.
4. The method according to claim 3, wherein said correcting third sample data for which the degree of matching is smaller than a first threshold value comprises:
correcting third sample data with the matching degree smaller than the second threshold value by using the updated model;
and/or the presence of a gas in the gas,
the correcting the third sample data of which the matching degree is smaller than a first threshold value includes:
and when a plurality of third sample data with the matching degree smaller than a first threshold value exist, correcting the third sample data in sequence according to the similarity among the plurality of third sample data.
5. An apparatus for annotating data, comprising: the system comprises a model training module, a labeling module and a correcting module; wherein the content of the first and second substances,
the model training module is used for generating a labeling model according to training of a first data set, wherein first sample data in the first data set is provided with one or more labels;
the labeling module is used for acquiring a second data set to be labeled, labeling the second data set by using the labeling model, determining a label of second sample data in the second data set and a position of the label in the second sample data by using the labeling model, and labeling the position to obtain a labeled third data set;
the correction module is configured to check third sample data in the third data set, determine a matching degree of a label of the third sample data in the third data set compared with a true label, and correct the third sample data whose matching degree is smaller than a first threshold, so as to obtain a labeling result corresponding to the second data set.
6. The apparatus of claim 5,
the correction module is configured to correct, according to the truth label, third sample data of which the matching degree is smaller than a first threshold and larger than a second threshold; wherein the second threshold is less than the first threshold.
7. The apparatus of claim 6,
the model training module is further configured to update the annotation model according to the corrected third sample data and the third sample data in the third data set, where the matching degree is not less than the first threshold.
8. The apparatus of claim 7,
the labeling module is configured to correct, by using the updated model, the third sample data of which the matching degree is smaller than the second threshold;
and/or the presence of a gas in the gas,
and the correction module is used for correcting the third sample data in sequence according to the similarity between the plurality of third sample data when the third sample data with the matching degree smaller than the first threshold is a plurality of third sample data.
9. An electronic device, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-4.
10. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-4.
CN201911106007.0A 2019-11-13 2019-11-13 Data labeling method and device Pending CN111104479A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911106007.0A CN111104479A (en) 2019-11-13 2019-11-13 Data labeling method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911106007.0A CN111104479A (en) 2019-11-13 2019-11-13 Data labeling method and device

Publications (1)

Publication Number Publication Date
CN111104479A true CN111104479A (en) 2020-05-05

Family

ID=70420468

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911106007.0A Pending CN111104479A (en) 2019-11-13 2019-11-13 Data labeling method and device

Country Status (1)

Country Link
CN (1) CN111104479A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112163424A (en) * 2020-09-17 2021-01-01 中国建设银行股份有限公司 Data labeling method, device, equipment and medium
CN112183321A (en) * 2020-09-27 2021-01-05 深圳奇迹智慧网络有限公司 Method and device for optimizing machine learning model, computer equipment and storage medium
CN112445831A (en) * 2021-02-01 2021-03-05 南京爱奇艺智能科技有限公司 Data labeling method and device
CN112836732A (en) * 2021-01-25 2021-05-25 深圳市声扬科技有限公司 Data annotation verification method and device, electronic equipment and storage medium
CN112926621A (en) * 2021-01-21 2021-06-08 百度在线网络技术(北京)有限公司 Data labeling method and device, electronic equipment and storage medium
CN114548192A (en) * 2020-11-23 2022-05-27 千寻位置网络有限公司 Sample data processing method and device, electronic equipment and medium
CN114612699A (en) * 2022-03-10 2022-06-10 京东科技信息技术有限公司 Image data processing method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190034823A1 (en) * 2017-07-27 2019-01-31 Getgo, Inc. Real time learning of text classification models for fast and efficient labeling of training data and customization
CN109582793A (en) * 2018-11-23 2019-04-05 深圳前海微众银行股份有限公司 Model training method, customer service system and data labeling system, readable storage medium storing program for executing
CN109886211A (en) * 2019-02-25 2019-06-14 北京达佳互联信息技术有限公司 Data mask method, device, electronic equipment and storage medium
CN109934227A (en) * 2019-03-12 2019-06-25 上海兑观信息科技技术有限公司 System for recognizing characters from image and method
CN110110811A (en) * 2019-05-17 2019-08-09 北京字节跳动网络技术有限公司 Method and apparatus for training pattern, the method and apparatus for predictive information
CN110378396A (en) * 2019-06-26 2019-10-25 北京百度网讯科技有限公司 Sample data mask method, device, computer equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190034823A1 (en) * 2017-07-27 2019-01-31 Getgo, Inc. Real time learning of text classification models for fast and efficient labeling of training data and customization
CN109582793A (en) * 2018-11-23 2019-04-05 深圳前海微众银行股份有限公司 Model training method, customer service system and data labeling system, readable storage medium storing program for executing
CN109886211A (en) * 2019-02-25 2019-06-14 北京达佳互联信息技术有限公司 Data mask method, device, electronic equipment and storage medium
CN109934227A (en) * 2019-03-12 2019-06-25 上海兑观信息科技技术有限公司 System for recognizing characters from image and method
CN110110811A (en) * 2019-05-17 2019-08-09 北京字节跳动网络技术有限公司 Method and apparatus for training pattern, the method and apparatus for predictive information
CN110378396A (en) * 2019-06-26 2019-10-25 北京百度网讯科技有限公司 Sample data mask method, device, computer equipment and storage medium

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112163424A (en) * 2020-09-17 2021-01-01 中国建设银行股份有限公司 Data labeling method, device, equipment and medium
CN112183321A (en) * 2020-09-27 2021-01-05 深圳奇迹智慧网络有限公司 Method and device for optimizing machine learning model, computer equipment and storage medium
CN114548192A (en) * 2020-11-23 2022-05-27 千寻位置网络有限公司 Sample data processing method and device, electronic equipment and medium
CN112926621A (en) * 2021-01-21 2021-06-08 百度在线网络技术(北京)有限公司 Data labeling method and device, electronic equipment and storage medium
CN112926621B (en) * 2021-01-21 2024-05-10 百度在线网络技术(北京)有限公司 Data labeling method, device, electronic equipment and storage medium
CN112836732A (en) * 2021-01-25 2021-05-25 深圳市声扬科技有限公司 Data annotation verification method and device, electronic equipment and storage medium
CN112836732B (en) * 2021-01-25 2024-04-19 深圳市声扬科技有限公司 Verification method and device for data annotation, electronic equipment and storage medium
CN112445831A (en) * 2021-02-01 2021-03-05 南京爱奇艺智能科技有限公司 Data labeling method and device
CN114612699A (en) * 2022-03-10 2022-06-10 京东科技信息技术有限公司 Image data processing method and device

Similar Documents

Publication Publication Date Title
CN111104479A (en) Data labeling method and device
CN107832045B (en) Method and apparatus for cross programming language interface conversion
CN108628830B (en) Semantic recognition method and device
CN109871311B (en) Method and device for recommending test cases
CN106919711B (en) Method and device for labeling information based on artificial intelligence
CN109359194B (en) Method and apparatus for predicting information categories
US9588952B2 (en) Collaboratively reconstituting tables
CN113377653B (en) Method and device for generating test cases
CN110705271B (en) System and method for providing natural language processing service
CN113626223A (en) Interface calling method and device
CN112988583A (en) Method and device for testing syntax compatibility of database
CN113760276A (en) Method and device for generating page code
CN113448869B (en) Method and device for generating test case, electronic equipment and computer readable medium
CN112818026A (en) Data integration method and device
CN112486482A (en) Page display method and device
CN113590756A (en) Information sequence generation method and device, terminal equipment and computer readable medium
CN109710634B (en) Method and device for generating information
CN107423271B (en) Document generation method and device
CN109871856B (en) Method and device for optimizing training sample
CN110647623B (en) Method and device for updating information
CN109857838B (en) Method and apparatus for generating information
CN114170451A (en) Text recognition method and device
CN110796137A (en) Method and device for identifying image
CN113742321A (en) Data updating method and device
CN112579080A (en) Method and device for generating user interface code

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20220926

Address after: 25 Financial Street, Xicheng District, Beijing 100033

Applicant after: CHINA CONSTRUCTION BANK Corp.

Address before: 25 Financial Street, Xicheng District, Beijing 100033

Applicant before: CHINA CONSTRUCTION BANK Corp.

Applicant before: Jianxin Financial Science and Technology Co.,Ltd.