CN115146716A

CN115146716A - Labeling method, device, equipment, storage medium and program product

Info

Publication number: CN115146716A
Application number: CN202210713931.0A
Authority: CN
Inventors: 袁松岭; 王子璇; 文心杰; 王晓利; 郭伟东; 刘雅良; 孟祥磊
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-06-22
Filing date: 2022-06-22
Publication date: 2022-10-04
Anticipated expiration: 2042-06-22
Also published as: CN115146716B

Abstract

The application discloses a labeling method, device, equipment, storage medium and program product, which belong to the technical field of artificial intelligence and comprise the following steps: acquiring a first raw data set, wherein the raw data is unmarked data; determining a plurality of first target biological data and a plurality of second target biological data in the first biological data set, wherein the first target biological data and the second target biological data are data which contain target information and have a degree meeting a preset requirement; acquiring true value data corresponding to each of the first target data, wherein the first target data comprise a data set for learning annotation and a data set for verifying learning effect; generating a plurality of annotation cases according to the first target generation data and the corresponding truth value data; and acquiring an annotation result aiming at the at least one second target raw data based on the plurality of annotation cases. The embodiment of the application can mark quickly and ensure the marking accuracy.

Description

Labeling method, device, equipment, storage medium and program product

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a labeling method, apparatus, device, storage medium, and program product.

Background

The acquisition of the labeled data is a prerequisite for promoting the development of artificial intelligence technology and realizing machine learning, the training of various intelligent models is not independent of the support of the labeled data, but the acquisition of the labeled data depends on the manual labeling of labeling personnel to a great extent, while the labeling personnel usually depend on the documentary labeling rules, but the expression capability of the documentary labeling rules is low, and the problems of incomplete rule coverage, inaccurate rule expression and the like can also exist, and the labeling personnel also consumes time and labor for texting the rule learning and needs to communicate with more personnel of the specified document, so that the acquisition duration of the labeled data is long, and the efficiency is low.

Disclosure of Invention

The embodiment of the application provides a labeling method, a labeling device, a labeling equipment, a labeling storage medium and a program product, which can improve the accuracy of labeling and reduce the time consumption of labeling.

According to an aspect of the embodiments of the present application, there is provided an annotation method, including:

acquiring a first life data set, wherein the life data in the first life data set are unlabeled data;

determining a plurality of first target biological data and a plurality of second target biological data in the first biological data set, wherein the first target biological data and the second target biological data are both data which contain target information and have a degree meeting a preset requirement, and the target information represents information in the first biological data set;

acquiring truth value data corresponding to the first objective raw data, wherein the first objective raw data comprise a data set for learning annotation and a data set for verifying learning effect;

generating a plurality of labeling cases according to the first target generation data and the corresponding truth value data, specifically including a data set for learning labeling and corresponding truth value data, and a data set for verifying learning effect and corresponding truth value data;

and acquiring an annotation result aiming at least one second target raw data based on the plurality of annotation cases.

According to an aspect of an embodiment of the present application, there is provided an annotation apparatus, including:

the device comprises a first life data set acquisition module, a first data set generation module and a first data set generation module, wherein the first life data set acquisition module is used for acquiring a first life data set, and the life data in the first life data set are unmarked data;

the data screening module is used for determining a plurality of first target biological data and a plurality of second target biological data in the first biological data set, wherein the first target biological data and the second target biological data are data which contain target information and the degree of which meets a preset requirement, and the target information represents information in the first biological data set;

a truth value obtaining module, configured to obtain truth value data corresponding to each of the plurality of first target generation data, where the plurality of first target generation data include a data set for learning annotation and a data set for verifying learning effect;

the case generation module is used for generating a plurality of labeling cases according to the first target generation data and the corresponding truth value data, specifically comprising a data set for learning labeling and corresponding truth value data, and a data set for verifying learning effect and corresponding truth value data;

and the annotation module is used for acquiring an annotation result aiming at least one second target raw data based on the plurality of annotation cases.

According to an aspect of embodiments of the present application, there is provided a computer device, the computer device comprising a processor and a memory, the memory having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by the processor to implement the above-mentioned labeling method.

According to an aspect of embodiments of the present application, there is provided a computer-readable storage medium having at least one instruction, at least one program, a set of codes, or a set of instructions stored therein, which is loaded and executed by a processor to implement the above-mentioned labeling method.

According to an aspect of embodiments herein, there is provided a computer program product comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to cause the computer device to execute to realize the labeling method.

The technical scheme provided by the embodiment of the application can bring the following beneficial effects:

the embodiment of the application provides a labeling method, the labeling method determines a plurality of raw data with high typicality and high representativeness from a large amount of raw data, labeling cases are obtained by labeling the raw data, the labeling cases are provided for a labeling person to learn by oneself, the labeling person can label other raw data on the basis of learning the cases, the information expression capacity of the labeling cases is far higher than that of text rules, the multi-dimensional learning capacity of the brain of the labeling person can be fully utilized by learning the labeling cases, the labeling speed and the labeling accuracy are improved, the communication time and the learning time of the text rules are saved, and the labeling efficiency is remarkably improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of an application execution environment provided by one embodiment of the present application;

FIG. 2 is a flow chart of a labeling method provided in an embodiment of the present application;

FIG. 3 illustrates a raw data screening diagram;

FIG. 4 illustrates a core set screening diagram;

FIG. 5 is a schematic flow chart of a label provided in accordance with an embodiment of the present application;

FIG. 6 is a schematic diagram of a visualization result of an annotation platform provided in an embodiment of the present application;

FIG. 7 illustrates a diagram of a practice problem;

FIG. 8 is a diagram illustrating examination questions;

FIG. 9 is a schematic diagram of an annotation case;

FIG. 10 illustrates a block diagram of an annotation device;

fig. 11 is a block diagram of a computer device according to an embodiment of the present application.

Detailed Description

Before describing the method embodiments provided in the present application, relevant terms or terms that may be referred to in the method embodiments of the present application are briefly described so as to be easily understood by those skilled in the art of the present application.

BERT (Bidirectional Encoder Representation from transforms) is a large-scale text pre-training model, and BERT uses 12-layer transform encoders to improve the reference performance of natural language processing tasks. Compared with word2vec (word vector), BERT pre-trained by massive texts can introduce more transfer knowledge into a classification algorithm, and provides more accurate text characteristics.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the implementation method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject, and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence base technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Machine Learning (ML) is a multi-domain cross subject, and relates to multi-domain subjects such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formula learning.

Deep learning: the concept of deep learning stems from the study of artificial neural networks. A multilayer perceptron with multiple hidden layers is a deep learning structure. Deep learning forms a more abstract class or feature of high-level representation properties by combining low-level features to discover a distributed feature representation of the data.

Computer Vision technology (CV) Computer Vision is a science that studies how to "see" a machine, and further, refers to the replacement of a camera and a Computer.

Human eyes recognize and measure the target and do other machine vision, and further do graphic processing, so that the computer processing becomes an image more suitable for human eye observation or transmitted to an instrument for detection. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. The computer vision technology generally includes technologies such as image processing, image recognition, image semantic understanding, image retrieval, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, virtual reality, augmented reality, synchronous positioning, map construction and the like, and also includes common human face recognition, fingerprint recognition and other biometric feature recognition technologies.

The key technologies of Speech Technology (Speech Technology) are Automatic Speech Recognition (ASR) and Speech synthesis (Text To Speech, TTS) as well as voiceprint Recognition. The computer can listen, see, speak and feel, and the development direction of future human-computer interaction is provided, wherein voice becomes one of the good human-computer interaction modes in the future.

Natural language processing is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language people use daily, so it has a close relation with the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

CNN (generic Neural Networks): the convolutional Neural network is a feed-forward Neural network (feed-forward Neural Networks) containing convolutional calculation and having a deep structure, is one of representative algorithms for deep learning, has a characterization learning capability, and can perform translation invariant classification on input information according to a hierarchical structure.

Raw data: there is no original data to be marked with a result.

Comparative Learning (contrast Learning): contrast learning is a method of task that describes similar and dissimilar things for machine learning models. With this approach, a machine learning model can be trained to distinguish between similar and different data samples.

Active Learning (Active Learning): active learning can select the most representative, most confusing and most informative sample in the birth data through a certain algorithm.

And the annotator is a worker for providing an annotation result on the annotation platform.

The demand personnel: and (4) issuing a labeling task on the labeling platform, wherein the personnel need to label a result.

Case labeling rules: typical samples are selected through an algorithm, the samples are labeled with answers and annotations by a demand person, and the labeling person learns the labeling rules through example data with the answers and the annotations.

Cloud technology refers to a hosting technology for unifying serial resources such as hardware, software, network and the like in a wide area network or a local area network to realize calculation, storage, processing and sharing of data.

The cloud technology is a general term of network technology, information technology, integration technology, management platform technology, application technology and the like applied based on a cloud computing business model, can form a resource pool, is used as required, and is flexible and convenient. Cloud computing technology will become an important support. Background services of the technical network system require a large amount of computing and storage resources, such as video websites, picture-like websites and more web portals. With the high development and application of the internet industry, each article may have its own identification mark and needs to be transmitted to a background system for logic processing, data in different levels are processed separately, and various industrial data need strong system background support and can only be realized through cloud computing.

In the related art, in the fields of artificial intelligence, machine learning, neural networks, and the like, to train a model, first, it is necessary to acquire labeled data. In the labeling process, generally, a person in need first provides data to be labeled, lists detailed rules for labeling the data, and performs a test labeling process. For example, when a data annotation set needs to be acquired, corresponding sub-task descriptions need to be generated according to corresponding annotation rules of various types of data in the data set, and an annotation person performs annotation according to the rules. For example, the demand personnel and annotators need to run in with a regular understanding, which may be done in 2-9 days. In most cases, the running-in rule requires more time than the annotator actually annotates the data, otherwise the recovered data cannot be accepted.

For the labeling of specific small sample data, the labeling rule is complex and tedious to formulate, and the labeling rule takes longer time in the whole labeling process under the condition of smaller labeling quantity, so that the labeling efficiency is low, the time consumed for the preposition work of labeling is longer, and the whole labeling efficiency is also influenced.

In the actual labeling process, a labeling person usually encounters a data sample with no rule coverage and ambiguous edges, so that the communication process of labeling is increased. According to different labeling difficulties, the process usually needs multiple rounds of communication, and for some complicated labeling tasks, the background knowledge and expression modes between labeling personnel and demand personnel are limited, so that the understanding of the character rules is deviated, the efficiency of labeling communication is influenced, and the whole labeling time period is prolonged. According to statistics of some labeling platforms, the running-in time of the rules is basically 2-9 days, and most of data labeling can be completed within 7 days after the rules are successfully run in. In addition, some persons who need to make rules can give a small number of labeled cases, but the typical cases are manually selected, so that the time is consumed, the comprehensiveness is not enough, and the time cost for communication is difficult to completely avoid.

In view of this, the embodiment of the present application provides a labeling method, in which a plurality of raw data with high typicality and high representativeness are determined from a large amount of raw data, labeling cases are obtained by labeling the raw data, the labeling cases are provided for a labeler to learn by himself, the labeler can label other raw data on the basis of learning the cases, the information expression capability of the labeling cases is far higher than that of text rules, and the multidimensional learning capability of the brain of the labeler can be fully utilized by learning the labeling cases, so that the labeling speed and the labeling accuracy are improved, the communication time and the learning time of the text rules are saved, and the labeling efficiency is significantly improved.

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Referring to fig. 1, a schematic diagram of an application execution environment according to an embodiment of the present application is shown. The application execution environment may include: a terminal 10 and a server 20.

The terminal 10 includes, but is not limited to, a mobile phone, a computer, an intelligent voice interaction device, an intelligent appliance, a vehicle-mounted terminal, a game console, an electronic book reader, a multimedia playing device, a wearable device, and other electronic devices. A client of the application may be installed in the terminal 10.

In the embodiment of the present application, the application may be any application capable of annotating a service. Optionally, a client of the above application program runs in the terminal 10. The server 20 is used to provide background services for clients of applications in the terminal 10. For example, the server 20 may be a backend server for the application described above. The server 20 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a web service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), a big data and artificial intelligence platform. Optionally, the server 20 provides background services for applications in multiple terminals 10 simultaneously.

Alternatively, the terminal 10 and the server 20 may communicate with each other through the network 30. The terminal 10 and the server 20 may be directly or indirectly connected through wired or wireless communication, and the present application is not limited thereto.

Referring to fig. 2, a flowchart of a labeling method provided in an embodiment of the present application is shown, where the method is applied to a first recommendation system. The method can be applied to a computer device, which refers to an electronic device with data calculation and processing capabilities, for example, the execution subject of each step can be the server 20 in the application program running environment shown in fig. 1. The method may comprise the steps of:

s101, a first life data set is obtained, and the life data in the first life data set are unlabeled data.

The first raw data set is not limited in the embodiment of the present application, and may be considered as an original set formed by raw data for labeling, for example, if a puppy in a picture needs to be labeled, the raw data is various pictures that may contain the puppy. Of course, the first raw data set may be formed by screening an original set of raw data for labeling and screening several pieces of representative raw data.

Specifically, a second raw data set may be obtained, where the raw data in the second raw data set is original data that is not labeled. And performing feature extraction on each raw data in the second raw data set to obtain corresponding feature information. The first raw data set is determined from the second raw data set based on each of the feature information. The second raw data set may be considered as an original set formed by raw data for labeling.

The embodiment of the present application does not limit the specific method for performing feature extraction or determining the first raw data set in the second raw data set. For example, a first raw data set can be formed by combining contrast Learning (contrast Learning) and Active Learning (Active Learning) and selecting representative and typical raw data from a second raw data set. Specifically, a first model for performing feature extraction for each piece of raw data in the second raw data set may be obtained by comparative learning training, and a second model for specifying the first raw data set in the second raw data set based on each piece of feature information may be obtained by active learning training.

In order to obtain better semantic representation (feature extraction) of the unmarked raw data, the embodiment of the application can introduce contrast learning. The core of the comparison learning is to construct positive and negative sample sets, for example, the image field is generally subjected to data enhancement operations such as rotation and cropping, and the text field is often subjected to methods such as translation, character insertion and deletion. By drawing similar samples close and pushing dissimilar samples away, the purpose of learning a good semantic expression space from the samples is achieved, and therefore the accuracy of feature extraction is improved.

To obtain representative and typical data, embodiments of the present application may also incorporate active learning. Active learning can recognize that not all data is equally valuable, and can discover which data in the second raw data set is more valuable and more informative, thereby screening out the first raw data set. And marking through the typical data in the first life data set to obtain a better marking result.

The active learning is mostly applied to the condition that a small amount of labeled data exist, and more valuable data are selected through an active learning model obtained through training. However, when the cold start is performed without any labeled data, the active learning algorithm is difficult to obtain good performance. According to the embodiment of the application, the comparison learning and the active learning can be combined, good semantic representation (feature extraction) is obtained through the comparison learning unsupervised training, and then representative data are selected through the active learning, so that the cold start problem is solved.

In one embodiment, please refer to fig. 3, which shows a schematic diagram of raw data filtering. For unmarked raw data, firstly, a BERT model or a Resnet model can be adjusted by adopting contrast learning; and then, obtaining vector representation of the text or the picture according to the trained model, thereby realizing feature extraction, and then selecting typical raw data based on the result of feature extraction by an active learning method. The following details are respectively for comparative learning and active learning:

(1) Contrast learning

For text type raw data, the unscupervied SimCSE can be used to train the BERT model and obtain a text vector representation. The Unsupervised SimCSE is an Unsupervised contrast learning model, and the main idea is to adopt exit Noise data enhancement (Dropout Noise data enhancement) to carry out contrast learning Unsupervised training. The mathematical representation is:

and

f _θ a representation of the encoder map is shown,

and

representing a different vector representation. When the encoder trains according to the batch, the training objective function is as follows:

where τ is a hyperparameter, e.g., the value may be 0.05, e represents the natural logarithm, and N represents the amount of training data in a batch.

In some embodiments, the training process of the contrasted learned model may also be used to obtain characterization results that are as similar as possible for similar textual content, and as dissimilar as possible for dissimilar content. The better semantic representation of unmarked raw and sparse data can be obtained by a comparative learning model obtained by adjusting BERT through the unscuperviced SimCSE algorithm, and a foundation is laid for actively learning and selecting typical raw data.

For the image type raw data, a typical neural network ResNet model can be trained by adopting a visual representation contrast learning framework SimCLR algorithm, and image vector representation aiming at the image type raw data is obtained according to the trained model, so that feature extraction is completed. Supposing that the input picture is x, carrying out two times of picture enhancement by an image processing tool to respectively obtain a picture x _i ,x _j The vector for the picture is thus represented as: h is a total of _i ＝f(x _i )＝ResNet(x _i )，z _i ＝g(h _i ) Where g (-) is a single-layer multilayer perceptron, and in the case of batch training, the training objective function is:

wherein 1 _[k≠i] E {0,1} is a flag bit function whose value 1, N indicates the amount of training data in the batch if k ≠ i.

(2) Active learning

After better semantic information representation is obtained through contrast learning, the method integrates various active learning methods such as clustering and kernel set CoreSet, and the like, so that the screening of raw data is completed. Taking a clustering algorithm and a CoreSet algorithm as examples for simple introduction:

the clustering algorithm is multiple, can be selected at will, takes a K-means algorithm as an example, is a clustering analysis algorithm for iterative solution, and divides data into K groups. The method comprises the steps of randomly selecting K objects as initial clustering centers, then calculating the distance between each object and each seed clustering center, and allocating each object to the nearest clustering center. The cluster centers and the objects assigned to them represent a cluster. The cluster center of a cluster is recalculated for each sample assigned, based on the existing objects in the cluster. This process will be repeated until some termination condition is met. The termination condition may be that no (or minimum number) objects are reassigned to different clusters, no (or minimum number) cluster centers are changed again, and the sum of squared errors is locally minimal. And finally, the data corresponding to the K clustering centers is obtained and is used as the selected typical data.

The CoreSet algorithm refers to data screening by applying a core set construction method to achieve the following goals: the selected subset contains the least redundant information and the subset is able to maximally contain the unselected set information. That is, the selected coreset data is able to represent as much of the full amount of data as possible. As shown in FIG. 4, each center point constitutes a kernel set representing a radius δ _s Data information in (2), and the distance between a point and a point represents the similarity. Choosing the kernel set data, the problem may be to select b center points that minimize the maximum distance between a data point and its nearest center. The mathematical expression is:

wherein s is ⁰ And s ¹ Is a collection of data. The solution process of the algorithm can be summarized as: and calculating the minimum distance of all samples in the selected data set U to the unselected data set L, and selecting n samples with the maximum distance relative to L to merge into L, thereby iterating. The pseudo-code for this algorithm is as follows:

inputting: selected data set L ₀ Unselected data set U and total data quantity b to be selected newly;

initialization L = L ₀ ；

And (3) circulation:

u＝argmax _i∈L min _j∈U Δ(x _i ,x _j ) Δ (·) denotes distance;

L＝L∪{u}；

until: l | = b + | L ₀ |；

And (3) returning: subtracting the data left by the initial selected data set L0 from all the selected data sets L;

a subset is chosen from the full amount of data by CoreSet algorithm, which removes redundant data so that the subset is as close to and represents the entire data set as possible. Voting scoring (ensembles) is performed on data selected by different active learning algorithms, such as clustering, coreSet, and the like), however, the embodiment of the present application does not limit the voting method, and the final most representative and typical data can be selected by using a correlation technique.

S102: and determining a plurality of first target biological data and a plurality of second target biological data in the first biological data set, wherein the first target biological data and the second target biological data are both data which contain target information and have the degree meeting a preset requirement, and the target information represents information in the first biological data set.

In this embodiment, the first target data and the second target data are both data that include target information to the extent that the target information satisfies a predetermined requirement, and the meaning of the target information representing the information included in the first raw data set is that the first target data and the second target data are both representative data selected from the first raw data set, and a method for selecting the first target data and the second target data from the first raw data set may be based on the same inventive concept as the method for selecting the first raw data set from the second raw data set, which is not repeated herein. For example, the foregoing method may be used to select a plurality of raw data from a first raw data set, and then divide the plurality of raw data into two types by manual division or random allocation, where one type belongs to a first target raw data and the other type belongs to a second target raw data. Of course, in some embodiments, it may be defined that the degree of the target information included in any one of the first target generation data is higher than the degree of the target information included in any one of the second target generation data.

In the application, the first life data set is screened from the second life data set, the first target life data and the second target life data are screened from the first life data set, the screening can be realized based on contrast learning and active learning, in order to verify the effectiveness of the screening thought, a large number of contrast experiments are performed on different business data and public data sets, and the screening thought is proved to have obvious improvement in data quality compared with the data obtained by randomly selecting the data, and the table-1 shows.

TABLE-1

This table is used to verify that the algorithm used to pick the data is correct and valid. The performance of the data selected by the algorithm is improved compared with the data selected randomly. The screening method of the embodiment of the application has better effect than random effect by adding 10% of data each time.

S103: and acquiring true value data corresponding to the first target data, wherein the first target data comprises a data set for learning annotation and a data set for verifying learning effect.

The truth data can be given by a labeling requirement method, that is, a demander gives truth data according to self requirements, and the demander needs to give comments on bolts, nuts and the like in the image, so that labeling information of the bolts, the nuts and the like can be given by first target data of the image type, and the labeling information is truth data.

S104: generating a plurality of labeling cases according to the first target generation data and the corresponding truth value data, specifically including according to the data set for learning labeling and the corresponding truth value data, and according to the data set for verifying the learning effect and the corresponding truth value data.

The annotation cases are used for the annotator to learn, so that the annotator can fully know the annotation requirements of the demander, and can annotate other raw data, such as the second target raw data.

In an embodiment, the generating a plurality of labeling cases according to each of the first target generation data and the corresponding true value data specifically includes generating a plurality of labeling cases according to the data set for learning labeling and the corresponding true value data, and according to the data set for verifying learning effect and the corresponding true value data, including: and classifying the plurality of first target biological data to obtain a first-class first-life data set and a second-class first-life data set, wherein the first-class first-life data set is a data set for learning labeling, and the second-class first-life data set is a data set for verifying learning effects.

Classifying the first target raw data in the first type of first raw data set and the marking cases generated by the corresponding truth value data into a first type of marking case set; classifying the marking cases generated by the first target raw data and the corresponding true value data in the second type first raw data set into a second type marking case set.

S105: and acquiring an annotation result aiming at the at least one second target raw data based on the plurality of annotation cases.

The obtaining of the labeling result for at least one second target raw data based on the plurality of labeling cases includes: and acquiring a labeling result aiming at least one second target raw data based on the first type of labeling case set and the second type of labeling case set.

For example, 10 cases can be formed according to the truth data given by the demand side, wherein five cases belong to a first type of annotation case set, the other five cases belong to a second type of annotation case set, the annotator can learn the first type of annotation case set, then the annotator is shown with the raw data in the second type of annotation case set to label the raw data, and the labeling result is compared with the truth data in the second type of annotation case set, so that the learning result of the annotator can be judged, and if the learning result is better, the annotator can label the relevant raw data.

In an embodiment, the at least one second target data may be displayed in a case where a target message is obtained, where the target message represents that the learning for the labeled case is completed; and acquiring the labeling result for the at least one second target raw data in response to the detection of the labeling operation for the at least one second target raw data.

In an embodiment, before displaying the at least one second target data when the target message is acquired, the method includes: displaying a first type of annotation case set; responding to the condition of acquiring the first message, and displaying at least one first target biological data in the second type first biological data set; in response to the situation that a to-be-verified labeling result for at least one first target raw data in the second type first raw data set is obtained, verifying the to-be-verified labeling result according to the second type labeling case; and acquiring the target message under the condition that the verification is passed.

The embodiment of the application aims to solve the problem that pain points of rule making and running-in rules are generated in the data marking process, and the embodiment of the application generates case marking rules to replace traditional text marking rules. The annotator annotates through learning case annotation rules, so that the process that demand personnel formulate complex annotation rules and communicate in multiple rounds is avoided, and the overall efficiency is improved.

Specifically, the embodiment of the application can combine comparative learning and active learning to sort out some typical and representative raw data from unlabeled raw data (a first raw data set) to obtain first target raw data and second target raw data. Firstly, a person needing to label the selected typical data (first target data) and give a labeling basis, and then label cases are generated from the data, and the label cases can be divided into two types, namely a first type label case and a second type label case, wherein the first type label case is used as a practice question and an adaptive answer of a label maker, and the second type label case is used as an examination question and an adaptive answer. And the marking personnel finish the marking thinking of the training exercises, the learning and fitting demand personnel according to the marking cases. And finally, screening out the labeling personnel meeting the requirements to participate in formal labeling through examination questions. On one hand, representative data are screened by a machine learning method and are submitted to a demand worker for test marking to obtain a marking case, so that the demand worker is directly prevented from making a complex marking rule and a rule communication link, the overall marking efficiency is greatly improved, and the effect improvement of small samples is particularly obvious; on the other hand, the marking cases can be used as examination questions to screen marking personnel, the personnel who understand the marking requirements are selected to participate in the marking process, and the final delivery quality is improved to a certain extent.

In one embodiment, please refer to fig. 5, which shows a complete flow of the labeling, which specifically includes the following steps:

the method comprises the following steps: and submitting data, wherein a person is required to submit unmarked raw data on a visual page firstly, the part of data can form a first raw data set, and the first raw data set can be screened out based on the part of data.

Step two: typical data are selected, after successful submission, typical raw data can be selected through a comparison learning algorithm and an active learning algorithm, the screened raw data are divided into first target raw data and second target raw data, and a part of the first target raw data can be used for forming practice problems and examination problems and generating a problem making page.

Step three: and (4) the required personnel do questions, and the required personnel give out marking answers and annotations to generate marking cases.

Step four: and creating a labeling task, providing a labeling case, a practice problem and an examination problem by a labeling platform, enabling a labeling person to learn on the labeling platform, and labeling the second target data displayed by the labeling platform after the learning is finished.

Step five: and (4) checking and accepting, and marking to finish manual checking and accepting or automatic checking and accepting.

Referring to fig. 6, which shows a visualization result of a labeling platform, taking a picture labeling task as an example, given a label of a picture, a labeler needs to determine whether the given label exists in the picture. The left example diagram is a marking case automatically generated by the system after standard answers and comments are given by a person needing to do typical practice questions; and the right side is data to be labeled, and the labeling personnel reference the example labeling rule on the left side for labeling.

In order to verify the effectiveness of the examples of the present application, the examples of the present application provide two sets of comparative experimental data, and the detailed experimental data are as follows:

the following table shows the effect of obtaining the labeled data of specific scenes, war guidance demand scenes and click guidance data card demand scenes

As shown in the table, based on comparison between the difficult historical requirements and the simple historical requirements, the embodiment of the application omits the previous steps of rule making and communication by generating practice problems, examination problems and marking cases. When the acceptance passing rate is achieved, the efficiency of small sample labeling is obviously improved, and 80% -90% of time consumption is saved in unit time consumption (per 1000 data volume) labeling time. Through case rule marking, the overall AI model research and development efficiency is improved, and the research and development cost is reduced.

Taking the labeling requirement of a text learning class as an example, it needs to judge whether the comment is a hydrological comment. The detailed process is as follows:

selecting 50 exercise questions and examination questions by the model;

the person needing to give exercise questions, answers to examination questions and marking basis (the marking basis of the examination questions can be selected); wherein, the practice questions and the examination questions are respectively represented by the figures 7 and 8;

generating 5-10 example labeling rules, namely labeling cases, according to the answers and the basis of the practice questions and the examination questions, wherein the labeling cases are represented by the figure 9;

marking personnel to refer to the marking cases to finish the exercise; the person who finishes marking the exercise questions refers to the test, 15-20 questions are randomly drawn from the test questions, and the person who passes the test can participate in formal task marking.

In this example, the irrigation comment requirement labeling completion acceptance accuracy rate is 94% (acceptance criterion is 90%), the total critical time of the whole process is about 4 hours, wherein the total time of data selection on the model side is 12 minutes, the labeling time is 120 minutes, and the time of a requirement unit (labeling 1000 data volume units) is about 18 hours. Compared with the current AI data labeling task of the data kitchen, P90 is counted, namely historical requirements of 90 quanta (without case labeling), the time consumption of a demand unit (the time consumption of labeling a 1000 data volume unit) is about 30 hours, and the effect improvement of the visible case labeling scheme is very obvious. The details are given in the following table:

the following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.

Referring to fig. 10, a block diagram of a labeling apparatus provided in an embodiment of the present application is shown. The device has the function of realizing the marking method, and the function can be realized by hardware or hardware executing corresponding software. The device can be a computer device and can also be arranged in the computer device. The apparatus may include:

a first lifetime data set acquiring module 101, configured to acquire a first lifetime data set, where the lifetime data in the first lifetime data set is data that is not labeled;

a data screening module 102, configured to determine, in the first lifetime data set, a plurality of first target-specific data and a plurality of second target-specific data, where the first target-specific data and the second target-specific data are both data that include target information and whose degree meets a preset requirement, and the target information represents information included in the first lifetime data set;

a truth value obtaining module 103, configured to obtain truth value data corresponding to each of the first objective data, where the first objective data includes a data set for learning annotation and a data set for verifying learning effect;

a case generating module 104, configured to generate a plurality of labeling cases according to each of the first target generation data and the corresponding true value data, specifically including a data set for learning labeling and corresponding true value data, and a data set for verifying a learning effect and corresponding true value data;

and the labeling module 105 is configured to obtain a labeling result for the at least one second target raw data based on the plurality of labeling cases.

In an exemplary embodiment, the labeling module 105 is configured to display the at least one second target data when a target message is obtained, where the target message represents that the learning for the labeled case is completed;

and acquiring the labeling result of the at least one second target raw data in response to the detection of the labeling operation of the at least one second target raw data.

In an exemplary embodiment, the case generation module 104 is configured to:

classifying the plurality of first target biological data to obtain a first-class first biological data set and a second-class first biological data set, wherein the first-class first biological data set is a data set for learning labeling, and the second-class first biological data set is a data set for verifying learning effect;

classifying the first target raw data in the first-class first raw data set and the marking cases generated by the corresponding truth value data into a first-class marking case set;

classifying the first target raw data in the second type first raw data set and the marking cases generated by the corresponding truth value data into a second type marking case set;

and acquiring an annotation result aiming at least one second target raw data based on the first type of annotation case set and the second type of annotation case set.

In an exemplary embodiment, the annotation module 105 is configured to:

displaying a first type of annotation case set;

responding to the condition of acquiring the first message, and displaying at least one first target biological data in the second type first biological data set;

in response to the situation that a to-be-verified labeling result for at least one first target raw data in the second type first raw data set is obtained, verifying the to-be-verified labeling result according to the second type labeling case;

and acquiring the target message under the condition that the verification is passed.

In an exemplary embodiment, a degree of inclusion of the target information in any of the first target raw data is higher than a degree of inclusion of the target information in any of the second target raw data.

In an exemplary embodiment, the first raw data set acquisition module 101 is configured to:

acquiring a second raw data set, wherein raw data in the second raw data set is original unmarked data;

performing feature extraction on each raw data in the second raw data set to obtain corresponding feature information;

the first raw data set is determined from the second raw data set based on each of the feature information.

and obtaining a first model through comparative learning training and obtaining a second model through active learning training, wherein the first model is used for performing feature extraction on each piece of raw data in the second raw data set, and the second model is used for determining the first raw data set in the second raw data set according to each piece of feature information.

It should be noted that, when the apparatus provided in the foregoing embodiment implements the functions thereof, only the division of the functional modules is illustrated, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the apparatus may be divided into different functional modules to implement all or part of the functions described above. In addition, the apparatus and method embodiments provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments for details, which are not described herein again.

Referring to fig. 11, a block diagram of a computer device according to an embodiment of the present application is shown. The computer device may be a server for performing the above-mentioned annotation method. Specifically, the method comprises the following steps:

the computer device 1600 includes a Central Processing Unit (CPU) 1601, a system Memory 1604 including a Random Access Memory (RAM) 1602 and a Read Only Memory (ROM) 1603, and a system bus 1605 connecting the system Memory 1604 and the CPU 1601. Computer device 1600 also includes a basic Input/Output system (I/O) 1606, which facilitates transfer of information between devices within the computer, and a mass storage device 1607 for storing an operating system 1613, application programs 1614, and other program modules 1615.

The basic input/output system 1606 includes a display 1608 for displaying information and an input device 1609 such as a mouse, keyboard, etc. for inputting information by a content consumption object. Wherein a display 1608 and an input device 1609 are connected to the central processing unit 1601 by way of an input-output controller 1610 which is connected to the system bus 1605. The basic input/output system 1606 may also include an input-output controller 1610 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, input-output controller 1610 may also provide output to a display screen, a printer, or other type of output device.

The mass storage device 1607 is connected to the central processing unit 1601 by a mass storage controller (not shown) connected to the system bus 1605. Mass storage device 1607 and its associated computer-readable media provide non-volatile storage for computer device 1600. That is, the mass storage device 1607 may include a computer-readable medium (not shown) such as a hard disk or CD-ROM (Compact disk Read-Only Memory) drive.

Without loss of generality, computer readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, EPROM (Erasable Programmable Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory), flash Memory or other solid state Memory technology, CD-ROM, DVD (Digital Video Disc) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will appreciate that computer storage media is not limited to the foregoing. The system memory 1604 and mass storage device 1607 described above may be collectively referred to as memory.

According to various embodiments of the application, the computer device 1600 may also operate as a remote computer connected to a network over a network, such as the Internet. That is, the computer device 1600 may be connected to the network 1612 through the network interface unit 1611 coupled to the system bus 1605, or the network interface unit 1611 may be utilized to connect to other types of networks and remote computer systems (not shown).

The memory also includes a computer program stored in the memory and configured to be executed by the one or more processors to implement the tagging method.

In an exemplary embodiment, a computer readable storage medium is also provided, in which at least one instruction, at least one program, a set of codes, or a set of instructions is stored, which when executed by a processor implements the tagging method.

Optionally, the computer-readable storage medium may include: ROM (Read Only Memory), RAM (random access Memory), SSD (Solid State drive), or optical disc. The Random Access Memory may include a ReRAM (resistive Random Access Memory) and a DRAM (Dynamic Random Access Memory).

In an exemplary embodiment, a computer program product or computer program is also provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to cause the computer device to execute the labeling method.

The labeling method at least comprises the following steps:

acquiring a first life data set, wherein the life data in the first life data set are unmarked data;

determining a plurality of first target biological data and a plurality of second target biological data in the first biological data set, wherein the first target biological data and the second target biological data are both data which contain target information and have the degree meeting preset requirements, and the target information represents information in the first biological data set;

acquiring true value data corresponding to each of the first target data, wherein the first target data comprises a data set for learning annotation and a data set for verifying learning effect;

generating a plurality of labeling cases according to each first target generation data and corresponding truth value data, specifically including a data set for learning labeling and corresponding truth value data, and a data set for verifying learning effect and corresponding truth value data;

and acquiring an annotation result aiming at the at least one second target raw data based on the plurality of annotation cases.

In an embodiment, the obtaining an annotation result for at least one second target data based on the plurality of annotation cases includes:

under the condition that a target message is obtained, displaying the at least one second target data, wherein the target message represents that the learning aiming at the labeling case is finished;

and acquiring the labeling result for the at least one second target raw data in response to the detection of the labeling operation for the at least one second target raw data.

In an embodiment, before generating a plurality of labeling cases according to each of the first target generation data and the corresponding true value data, specifically including a data set for learning labeling and a corresponding true value data, and a data set for verifying learning effect and a corresponding true value data, the method includes:

after the generating a plurality of labeling cases according to each of the first target generation data and the corresponding true value data specifically includes generating a plurality of labeling cases according to the data set for learning labeling and the corresponding true value data, and according to the data set for verifying the learning effect and the corresponding true value data, the method further includes:

classifying the first target raw data in the first type of first raw data set and the marking cases generated by the corresponding truth value data into a first type of marking case set;

the obtaining of the labeling result for at least one second target raw data based on the plurality of labeling cases includes:

and acquiring a labeling result aiming at least one second target raw data based on the first type of labeling case set and the second type of labeling case set.

In an embodiment, before displaying the at least one second target data when the target message is acquired, the method includes:

displaying a first type of annotation case set;

In one embodiment, the degree of inclusion of the target information in any of the first target-specific data is higher than the degree of inclusion of the target information in any of the second target-specific data.

In an embodiment, the acquiring the first set of lifetime data includes:

In one embodiment, the method further comprises:

It should be understood that reference herein to "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. In addition, the step numbers described herein only exemplarily show one possible execution sequence among the steps, and in some other embodiments, the steps may also be executed out of the numbering sequence, for example, two steps with different numbers are executed simultaneously, or two steps with different numbers are executed in a reverse order to the order shown in the figure, which is not limited by the embodiment of the present application.

In addition, in the specific implementation of the present application, data related to content consumption object information and the like is referred to, when the above embodiments of the present application are applied to specific products or technologies, permission or consent of the content consumption object needs to be obtained, and the collection, use and processing of the related data need to comply with relevant laws and regulations and standards of relevant countries and regions.

The above description is only exemplary of the present application and should not be taken as limiting the present application, and any modifications, equivalents, improvements and the like that are made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method of labeling, the method comprising:

acquiring true value data corresponding to each of the plurality of first target data, wherein the plurality of first target data comprise a data set for learning annotation and a data set for verifying learning effect;

and acquiring an annotation result aiming at least one second target data based on the plurality of annotation cases.

2. The method of claim 1, wherein obtaining annotation results for at least one second target raw data based on the plurality of annotation cases comprises:

under the condition that a target message is obtained, displaying the at least one second target generation data, wherein the target message represents that the learning aiming at the labeling case is finished;

in response to detecting the condition of the annotation operation on the at least one second target raw data, acquiring the annotation result on the at least one second target raw data.

3. The method of claim 2, wherein generating the plurality of annotation cases according to each first target data and the corresponding truth data comprises generating the plurality of annotation cases according to the data set for learning annotation and the corresponding truth data, and according to the data set for verifying learning effect and the corresponding truth data, and comprises:

classifying the plurality of first target biological data to obtain a first-class first biological data set and a second-class first biological data set, wherein the first-class first biological data set is a data set used for learning labeling, and the second-class first biological data set is a data set used for verifying learning effect;

classifying the first target raw data in the first class of first raw data set and the labeling cases generated by the corresponding truth value data into a first class of labeling case set;

classifying the marking cases generated by the first target raw data in the second type first raw data set and the corresponding truth value data into a second type marking case set;

the obtaining of the annotation result for at least one second target raw data based on the plurality of annotation cases comprises:

4. The method according to claim 3, wherein before displaying the at least one second target data in case of obtaining the target message, comprising:

displaying a first type of annotation case set;

in response to the condition that the first message is acquired, displaying at least one first target biological data in the second type first biological data set;

in response to the condition that an annotation result to be verified of at least one first target raw data in the second type first raw data set is obtained, verifying the annotation result to be verified according to the second type annotation case;

5. The method of claim 1, wherein:

the degree of the target information contained in any one of the first target data is higher than the degree of the target information contained in any one of the second target data.

6. The method of any of claims 1 to 5, wherein the obtaining the first set of life data comprises:

and determining the first raw data set in the second raw data set according to each characteristic information.

7. The method of claim 6, further comprising:

obtaining a first model through comparative learning training, and obtaining a second model through active learning training, wherein the first model is used for performing feature extraction on each piece of raw data in the second raw data set, and the second model is used for determining the first raw data set in the second raw data set according to each piece of feature information.

8. A marking device, the device comprising:

the data screening module is used for determining a plurality of first target biological data and a plurality of second target biological data in the first biological data set, wherein the first target biological data and the second target biological data are both data which contain target information and meet preset requirements in degree, and the target information represents information in the first biological data set;

a case generating module, configured to generate a plurality of labeling cases according to each of the first target generation data and the corresponding true value data, specifically including a data set for learning labeling and corresponding true value data, and a data set for verifying a learning effect and corresponding true value data;

and the marking module is used for acquiring a marking result aiming at the at least one second target raw data based on the plurality of marking cases.

9. A computer device comprising a processor and a memory, the memory having stored therein at least one instruction, at least one program, set of codes or set of instructions, which is loaded and executed by the processor to implement the annotation method of any one of claims 1 to 7.

10. A computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by a processor to implement the annotation method according to any one of claims 1 to 7.

11. A computer program product, characterized in that it comprises computer instructions stored in a computer readable storage medium, which are read by a processor of a computer device from the computer readable storage medium, the processor executing the computer instructions, causing the computer device to execute to implement the annotation method according to any one of claims 1 to 7.