CN111340054A

CN111340054A - Data labeling method and device and data processing equipment

Info

Publication number: CN111340054A
Application number: CN201811549912.9A
Authority: CN
Inventors: 冯浩; 徐江; 王鹏
Original assignee: Beijing Didi Infinity Technology and Development Co Ltd
Current assignee: Beijing Didi Infinity Technology and Development Co Ltd
Priority date: 2018-12-18
Filing date: 2018-12-18
Publication date: 2020-06-26

Abstract

The application provides a data labeling method, a data labeling device and data processing equipment, wherein the method comprises the following steps: performing at least one iteration process on the classification model so as to enable the accuracy of the classification model to meet a preset condition; and processing at least one part of the data to be labeled by using the obtained classification model to obtain an automatic labeling result. Wherein each iteration process comprises: inputting other data to be labeled in the data to be labeled except the target data set into a classification model respectively to obtain a classification result; selecting at least part of data to be labeled, of which the confidence degrees of the classification results are in a preset range, from the other data to be labeled, and adding the data to be labeled into a target data set; and training a classification module according to the manual labeling result of the data to be labeled in the target data set. Therefore, automatic labeling of batch data can be realized under the condition of improving the quality of data labeling.

Description

Data labeling method and device and data processing equipment

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a data annotation method, an apparatus and a data processing device.

Background

With the development of computer technology, machine learning algorithms are more and more widely applied, and supervised learning algorithms are one of the algorithms commonly used. The supervised learning algorithm usually needs to train a pre-established recognition model by adopting a large amount of labeled data, and the quantity and accuracy of the labeled data directly influence the accuracy of the trained recognition model.

At present, the labeling data is mainly obtained by manually adding labels, so that the efficiency is low, errors are easy to occur, and the accuracy of the model obtained by final training is low.

Disclosure of Invention

In view of this, an object of the embodiments of the present application is to provide a data annotation method, an apparatus and a data processing device, which can implement automatic annotation on batch data under the condition of improving annotation accuracy.

According to an aspect of the present application, there is provided a data annotation method, the method comprising:

performing at least one iteration treatment on a preset classification model to enable the accuracy of the classification model to meet a preset condition, so as to obtain a trained classification model;

processing at least one part of the multiple pieces of data to be labeled by adopting the trained classification model to obtain an automatic labeling result;

wherein each iteration process comprises:

inputting other data to be labeled except for a target data set in the multiple data to be labeled into the classification model respectively to obtain respective classification results of the other data to be labeled; according to the confidence degree of the classification result, at least part of data to be labeled with the confidence degree within a preset range is selected from the other data to be labeled and added into the target data set; and training the classification model according to the manual labeling result of the data to be labeled in the target data set.

In a possible embodiment, the classification result includes a category label and a confidence of the category label;

selecting at least part of data to be labeled with the confidence degree within a preset range from the other data to be labeled according to the confidence degree of the classification result, and adding the data to be labeled into the target data set, wherein the method comprises the following steps:

and selecting data to be labeled with the confidence level lower than a preset threshold value from the other data to be labeled, and adding at least part of the selected data to be labeled into the target data set.

In a possible implementation, adding at least part of the selected data to be labeled to the target dataset includes:

sorting the selected data to be labeled according to the confidence degree of the classification result;

and according to the sorting result, sequentially selecting a preset number of pieces of data to be marked from the selected data to be marked according to the sequence of the confidence degrees from small to large, and adding the data to be marked into the target data set.

In a possible embodiment, the classification result includes a plurality of class labels and the confidence of each class label, and the sum of the confidence of the class labels is 1;

selecting data to be labeled with a preset classification result from the other data to be labeled, and adding at least part of the selected data to be labeled into the target data set; wherein the preset classification result is a classification result in which the confidence of at least one class label is 40% -60%.

randomly selecting a preset number of pieces of data to be marked from the selected data to be marked, and adding the data to be marked into the target data set.

In a possible implementation, each iteration process further includes:

and acquiring the artificial labeling result of at least part of the data to be labeled before training the classification model according to the artificial labeling result of the data to be labeled in the target data set.

In a possible implementation manner, the obtaining of the manual annotation result of the at least part of the data to be annotated includes:

aiming at each data to be marked in at least part of the data to be marked, acquiring a plurality of labels of the data to be marked, which are input by different users;

and selecting the label with the largest occurrence frequency from the plurality of labels, and adding the label to the data to be labeled to obtain the artificial labeling result of the data to be labeled.

In one possible embodiment, the method further comprises:

storing the manual labeling result of at least part of the data to be labeled into a search engine supporting a visualization tool;

and saving the automatic labeling result into the search engine.

In a possible implementation, before performing the iterative processing on the pre-established classification model for the first time, the method further includes:

determining an empty set as the target data set; alternatively, the first and second electrodes may be,

and selecting a part of the plurality of data to be labeled as a target data set, and training a pre-established classification model according to an artificial labeling result of the data to be labeled in the target data set to obtain the preset classification model.

In a possible implementation, each iteration process further includes:

after training the classification model according to the manual labeling result of the data to be labeled in the target data set, testing the classification model through a preset test set to obtain test accuracy;

and if the test accuracy meets the preset condition, taking the classification model as the trained classification model.

In a possible implementation manner, processing at least a part of the plurality of pieces of data to be labeled by using the trained classification model includes:

processing each piece of data to be labeled by adopting the trained classification model; alternatively, the first and second electrodes may be,

and processing other data to be labeled except the target data set in the multiple pieces of data to be labeled by adopting the trained classification model.

According to another aspect of the present application, there is provided a data annotation apparatus, the apparatus comprising:

the training module is used for carrying out at least one iteration treatment on a preset classification model so as to enable the accuracy of the classification model to accord with a preset condition and obtain a trained classification model;

the automatic labeling module is used for processing at least one part of the data to be labeled by adopting the trained classification model to obtain an automatic labeling result;

wherein each iteration process comprises:

inputting other data to be labeled except for a target data set in a plurality of data to be labeled into the classification model respectively to obtain respective classification results of the other data to be labeled; according to the confidence degree of the classification result, at least part of data to be labeled with the confidence degree within a preset range is selected from the other data to be labeled and added into the target data set; and training the classification model according to the manual labeling result of the data to be labeled in the target data set.

the training module selects at least part of data to be labeled with the confidence degree within a preset range from the other data to be labeled according to the confidence degree of the classification result, and adds the selected data to the target data set in a mode that:

In one possible embodiment, the training module adds at least part of the selected data to be labeled to the target dataset by:

In a possible implementation manner, the training module is further configured to, during each iteration, obtain a manual labeling result of at least part of the data to be labeled before training the classification model according to the manual labeling result of the data to be labeled in the target data set.

In a possible implementation manner, the training module obtains the manual annotation result of the at least part of the data to be annotated by:

In a possible embodiment, the apparatus further comprises:

the data storage module is used for storing the manual labeling result of at least part of the data to be labeled into a search engine supporting a visualization tool; and saving the automatic labeling result to the search engine.

In a possible embodiment, the apparatus further comprises:

a pre-training module for determining an empty set as the target data set before running the training module; or selecting a part of the plurality of pieces of data to be labeled as a target data set, and training a pre-established classification model according to an artificial labeling result of the data to be labeled in the target data set to obtain the preset classification model.

In a possible implementation manner, the training module is further configured to, during each iteration process, train the classification model according to an artificial labeling result of data to be labeled in the target data set, and then test the classification model through a preset test set to obtain a test accuracy; and if the test accuracy meets the preset condition, taking the classification model as the trained classification model.

In a possible implementation manner, the automatic labeling module is specifically configured to process each piece of data to be labeled by using the trained classification model; or processing other data to be labeled except the target data set in the multiple pieces of data to be labeled by adopting the trained classification model.

According to another aspect of the present application, there is provided a data processing apparatus comprising: the data marking device comprises a processor, a storage medium and a bus, wherein the storage medium stores machine readable instructions executable by the data processing device, when the data processing device runs, the processor and the storage medium communicate through the bus, and the processor executes the machine readable instructions to execute the steps of the data marking method.

According to another aspect of the present application, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the above-described data annotation method.

Based on any one of the above aspects, the data labeling method, the data labeling device, and the data processing device provided in the embodiments of the present application perform the following steps at least once on the classification model, so that the accuracy of the classification model meets the preset condition: inputting other data to be labeled in the data to be labeled except the target data set into a classification model respectively to obtain a classification result; selecting at least part of data to be labeled, of which the confidence degrees of the classification results are in a preset range, from the other data to be labeled, and adding the data to be labeled into a target data set; and training a classification module according to the manual labeling result of the data to be labeled in the target data set. And processing at least one part of the data to be labeled by the classification model to obtain an automatic labeling result. Through the design, automatic labeling of batch data can be realized under the condition of improving data labeling quality.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

Fig. 1 is a schematic view of an application scenario of a data processing device according to an embodiment of the present application;

fig. 2 is a schematic hardware structure diagram of a data processing device according to an embodiment of the present application;

fig. 3 is a schematic flowchart of a data annotation method according to an embodiment of the present application;

FIG. 4 is a flow chart of an iterative process provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of a pre-training step provided in an embodiment of the present application;

FIG. 6 is a schematic diagram illustrating an effect of active learning and random labeling according to an embodiment of the present application;

fig. 7 is a schematic flowchart of a process for obtaining a manual annotation result according to an embodiment of the present application;

FIG. 8 is a table of comparison accuracy between active learning and random labeling provided in an embodiment of the present application;

fig. 9 is a block diagram of a data annotation device according to an embodiment of the present application.

Icon: 100-a data processing device; 110-data annotation means; 111-a training module; 112-automatic labeling module; 113-a data storage module; 114-a pre-training module; 120-a storage medium; 130-a processor; 140-system bus; 150-network port; 160-I/O interface; 200-a data providing device; 300-a data storage device; 310-a database; 400-network.

Detailed Description

In order to make the purpose, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it should be understood that the drawings in the present application are for illustrative and descriptive purposes only and are not used to limit the scope of protection of the application. Also, it should be noted that the schematic drawings are not drawn to scale. The flowcharts used in this application illustrate the operations implemented by some embodiments of the present application. It should be understood that the operations of the flow diagrams may be performed out of order, and steps without logical context may be performed in reverse order or simultaneously. One skilled in the art, under the guidance of this application, may add one or more other operations to, or remove one or more operations from, the flowchart.

In addition, the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.

The solutions provided in the present application are set forth in connection with a specific application scenario "intelligent customer service system" in order to enable those skilled in the art to use the present application. It should be understood that the intelligent customer service system described herein may be any platform of customer service systems, such as customer service systems that may be networked car appointment platforms, express platforms, online transportation platforms, service platforms for both buyer and seller transactions, and the like. The present embodiment is not limited thereto. It will be apparent to those skilled in the art that the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the application. Although the embodiments of the present application are described primarily in the context of a "smart customer service system," it should be understood that this is merely one exemplary embodiment. The application can be applied to any other scene needing to use the supervised learning algorithm. Such as face recognition systems, information recommendation systems, etc.

In the intelligent customer service system, various problems are presented by users, such as service points, complaints, order information change, and the like. Only if the kind of the question posed by the user is accurately recognized, an answer satisfying the user's needs can be given. Currently, a supervised learning algorithm is usually adopted to train a recognition model so as to realize recognition of user problems. This requires a large number of questions posed by the user to be retrieved from the intelligent customer service system, and for each question to be added an accurate category label (i.e., the category to which the question belongs), training data for training the recognition model is obtained. In which a large amount of training data is required to obtain a high-precision recognition model.

In some embodiments, the above-described operation of adding category labels is typically performed manually. When the data volume to be labeled is large, a large amount of manpower and material resources are consumed, and the accuracy of the labeling result is difficult to ensure. Therefore, the present embodiment provides a data annotation method and device based on active learning, and the scheme provided by the present embodiment will be described in detail below.

Referring to fig. 1, in an application scenario of the present embodiment, a data processing apparatus 100 is provided, where the data processing apparatus 100 may communicate with a data providing apparatus 200 and a data storage apparatus 300 through a network 400 to obtain data to be annotated from the data providing apparatus 200, and store an annotation result of the data to be annotated in the data storage apparatus 300. The data providing device 200 may be any server device providing intelligent customer service, and is capable of providing data to be annotated, such as user question information.

The data storage device 300 may be any electronic device having a storage function. In one example, the data storage device 300 may be a server running a database 310. In another example, the database 310 running on the data storage device 300 may be replaced with a search engine supporting a visualization tool, which may be, for example, an ElasticSearch. The ElasticSearch is a lightweight search engine, can quickly search out required data by customizing search rules, and can visually display the searched data. Based on this, the user can search for a specific tagged result by configuring the search condition of the ElasticSearch. For example, in some application scenarios, when data annotation is performed, all possible category labels cannot be provided, so that there is a deviation in the annotation result, and a part of the annotated data may need to be re-annotated in a subsequent process. In the related art, since the specific data with the deviation cannot be determined, all the labeled data are usually re-labeled, which is costly. By the data storage device 300, the newly added tag, the tag associated with the newly added tag, or the keyword associated with the newly added tag can be used as a search condition to search out the labeled data that needs to be re-labeled.

Alternatively, the data storage device 300 may be a single storage device or may be a storage cluster (distributed or centralized). The data storage device 300 may include storage media such as mass Memory, removable Memory, volatile Read-and-write Memory, or Read-Only Memory (ROM), or any combination thereof. By way of example, mass storage may include magnetic disks, optical disks, solid state drives, and the like; removable memory may include flash drives, floppy disks, optical disks, memory cards, zip disks, tapes, and the like; volatile read-write memory may include Random Access Memory (RAM); the RAM may include Dynamic RAM (DRAM), Double data Rate Synchronous Dynamic RAM (DDR SDRAM); static RAM (SRAM), Thyristor-Based Random Access Memory (T-RAM), Zero-capacitor RAM (Zero-RAM), and the like. By way of example, ROMs may include Mask Read-Only memories (MROMs), Programmable ROMs (PROMs), Erasable Programmable ROMs (PERROMs), Electrically Erasable Programmable ROMs (EEPROMs), compact disk ROMs (CD-ROMs), digital versatile disks (ROMs), and the like.

When the data storage device 300 is a storage cluster formed by a plurality of storage devices, the storage medium may be deployed on the plurality of storage devices in a distributed manner.

Optionally, in this embodiment, the data processing device 100, the data providing device 200, and the data storage device 300 may be the same device, or may be different devices, for example, all of which are server devices providing intelligent customer service, and this embodiment is not limited thereto.

Network 400 may be used for the exchange of information and/or data. Network 400 may include a wired Network, a Wireless Network, a fiber optic Network, a telecommunications Network, an intranet, the internet, a Local Area Network (LAN), a Wide Area Network (WAN), a Wireless Local Area Network (WLAN), a Metropolitan Area Network (MAN), a Wide Area Network (WAN), a Public Switched Telephone Network (PSTN), a bluetooth Network, a ZigBee Network, or a Near Field Communication (NFC) Network, among others, or any combination thereof. In some examples, network 400 may include one or more network access points. For example, network 400 may include wired or wireless network access points, such as base stations and/or network switching nodes.

Fig. 2 illustrates a schematic diagram of exemplary hardware and software components of a data processing device 100 that may implement the concepts of the present application, according to some embodiments of the present application. For example, the processor 130 may be used on the data processing device 100 and used to perform the functions in the embodiments of the present application.

Alternatively, the data processing apparatus 100 may be a single electronic device, for example, a server, a personal computer, or other special devices, and the data processing apparatus 100 may also be a cluster formed by a plurality of electronic devices, for example, a server cluster formed by a plurality of servers, and the electronic devices in the cluster may implement the functions described in this embodiment in a distributed manner.

For example, the data processing device 100 may include one or more processors 130 for executing computer programs, a system bus 140, a network port 150 connected to a network, and a storage medium 120 of a different form, such as a disk, ROM, RAM, or any combination thereof. Illustratively, the data processing device 100 may also include computer programs stored in ROM, RAM, or other types of non-transitory storage media, or any combination thereof. According to these computer programs, the methods provided by the embodiments of the present application can be implemented. The data processing device 100 may also include Input/Output (I/O) interfaces 160 with other Input/Output devices (e.g., keyboard, display screen).

In some examples, processor 130 may include one or more processing cores (e.g., a single-core processor or a multi-core processor). Merely as distances, the Processor 130 may include a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), an Application Specific Instruction Set Processor (ASIP), a Graphics Processing Unit (GPU), a Physical Processing Unit (PPU), a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), a Programmable Logic Device (PLD), a controller, a microcontroller Unit, a Reduced Instruction Set Computer (Reduced Instruction Set Computer), a microprocessor, or the like, or any combination thereof.

For convenience of illustration, only one processor is described in the data processing apparatus 100, however, it should be noted that the data processing apparatus 100 of the present application may also include a plurality of processors, and thus the steps performed by one processor described in the present application may also be performed by a plurality of processors in combination or individually. For example, if the processor of the data processing device 100 executes steps a and B, it should be understood that steps a and B may also be executed by two different processors together, or executed separately on one processor. For example, a first processor performs step a and a second processor performs step B, or the first processor and the second processor perform steps a and B together.

It should be understood that the configuration shown in FIG. 2 is merely an example, and that data processing device 100 may also include many more than those shown in FIG. 1

Referring to fig. 3, fig. 3 is a diagram illustrating a data annotation method provided in this embodiment, where the method can be applied to a data processing apparatus 100. The individual steps involved in the method are described in detail below.

Step S110, carrying out at least one iteration process on a preset classification model to ensure that the accuracy of the classification model meets a preset condition, and obtaining the trained classification model.

In this embodiment, the plurality of pieces of data to be labeled may be all pieces of data that need to be labeled currently. The data to be labeled can be user question information acquired from the intelligent customer service system, wherein one piece of data to be labeled can be at least one statement unit in one piece of user question information, for example, the data to be labeled can be complete user question information or one statement in one piece of user question information. Of course, the data to be labeled may be data to be identified obtained from other systems. Such as face images in the face recognition system described above.

The classification model may be any Machine learning classification model, such as a Random Forest (Random Forest) model, a FastText (fast text) model, a Support Vector Machine (Support Vector Machine), and so forth. In a possible implementation, the classification model may employ a text classification model, such as a FastText model, considering that the user question information is usually text information, i.e. the data to be labeled is usually text information. In other possible implementations, the classification model may also be a non-text classification model, such as an image classification model, in which case the data to be annotated may be converted into image information that can be recognized by the image classification model.

In the present embodiment, each iteration process includes the steps shown in fig. 4.

Step S41, respectively inputting other data to be labeled, except for a target data set, in the multiple pieces of data to be labeled into the classification model, and obtaining respective classification results of the other data to be labeled.

The other data to be labeled refers to data to be labeled in the plurality of data to be labeled, except the data to be labeled in the target data set.

In a possible implementation, before the first iteration, the classification model may be an untrained model, and correspondingly, the target data set may be an empty set. In this case, in the first iteration, the other data to be labeled is the plurality of pieces of data to be labeled.

In yet another possible implementation, the classification model may be initially trained before the first iteration is performed. In this case, the method may further include the steps as shown in fig. 5 before the step S41 is performed for the first time.

Step S51, selecting a part of the plurality of pieces of data to be labeled as the target data set.

In an optional manner, 2% -5% d of the data to be annotated from the plurality of data to be annotated may be selected as the target data set. In other alternative manners, a greater or lesser proportion of the data to be annotated may also be selected as the target data set.

Step S52, training a pre-established classification model according to the artificial labeling result of the data to be labeled in the target data set to obtain the preset classification model.

And step S42, selecting at least part of data to be labeled with the confidence coefficient within a preset range from the other data to be labeled according to the confidence coefficient of the classification result, and adding the selected data to be labeled into the target data set.

In this embodiment, the classification result includes at least one category label and a confidence level of each category label, where the confidence level of any category label represents a probability that the input data to be labeled belongs to a category indicated by the category label. For example, after a certain data x to be labeled is input into the classification model, the confidence of the output class label a is 50%, which indicates that: the probability that the data x to be labeled belongs to the category indicated by the category label a is 50%. Correspondingly, it also means that the classification model has difficulty distinguishing the category of the data x to be labeled.

Alternatively, the preset range may be a range indicating that the classification model is difficult to distinguish the data to be labeled, for example, when a category label with a confidence degree of 40% to 60% exists in the classification result of any data to be labeled, the classification model may be considered to be difficult to distinguish the data to be labeled. For another example, in an application scenario with a high requirement on accuracy, when a category label with a confidence of 30% to 70% exists in a classification result of any data to be labeled, it may be considered that the classification model is difficult to distinguish the data to be labeled.

In this embodiment, when the classification model is a binary classification model, each classification result usually includes a class label and a confidence of the class label. In this case, step S42 may include the steps of:

selecting data to be annotated with confidence coefficient lower than a preset threshold value from the other data to be annotated;

and adding at least part of the selected data to be marked into the target data set.

The preset threshold may be set according to the requirement of the application scenario on the classification accuracy, and may be any value of 55-70%, such as 60%, for example.

Through the process, the data to be labeled, which are difficult to distinguish by the classification model, can be screened out, the manual labeling is carried out, and then the classification model is trained according to the manual labeling result. This approach is called active learning, and by active learning, the accuracy of the classification model can be improved.

For example, fig. 6(a-c) shows an effect diagram of active learning, where fig. 6(a) shows a plurality of pieces of data to be labeled in an example, the plurality of pieces of data to be labeled include two types of data, and the two types of data are respectively represented by circles and triangles.

In each iteration process, if a random labeling manner is adopted, the part of the data to be labeled shown in fig. 6(a) is manually labeled, a classification model is trained according to an obtained manual labeling result, and then the classification model obtained through training is adopted to classify the data to be labeled in fig. 6(a), so that a classification plane S1 shown in fig. 6(b) can be obtained. Wherein circles and triangles with shaded portions represent artificial labeling results.

In each iteration process, if an active learning manner is adopted, that is, part of the data to be labeled is selected for manual labeling in the manner described in the above step S42, a classification model is trained according to the obtained manual labeling result, and then the obtained classification model is used for classifying the data to be labeled in fig. 6(a), so as to obtain a classification plane S2 shown in fig. 6 (c). Wherein circles and triangles with shaded portions represent artificial labeling results. Combining fig. 6(b) and fig. 6(c), it can be seen that the classification model obtained by the active learning method has higher classification accuracy.

In order to reduce the workload of manual annotation, a part of the data to be annotated, the confidence of which is lower than a preset threshold value, can be selected for manual annotation. Correspondingly, the adding of at least part of the selected data to be labeled to the target dataset can be realized by the following steps:

In this embodiment, the preset number may be determined according to the total amount of the data to be labeled, which needs to be labeled, that is: the method can be determined according to the total amount of the plurality of pieces of data to be annotated. For example, the number of the pieces of data to be labeled may be 2% -5%, and may be a greater or lesser proportional number. The above-mentioned

In an example, if there are 5 pieces of the selected data to be labeled, the confidence degrees of the classification results are 50%, 52%, 59%, 60%, and 54%, respectively, and assuming that the preset number is 3, three pieces of data to be labeled corresponding to three classification results with confidence degrees of 50%, 52%, and 54% may be selected and added to the target data set.

In yet another example, if there are 6 pieces of selected data to be labeled, the confidence degrees of the classification results are 50%, 52%, 58%, 59%, and 60%, respectively, and assuming that the preset number is 3, two pieces of data to be labeled corresponding to two classification results with a confidence degree of 50% and two pieces of data to be labeled corresponding to a classification result with a confidence degree of 52 may be selected and added to the target data set.

In this embodiment, when the classification model is a multi-classification model, each classification result usually includes a plurality of class labels and the confidence of each class label. Wherein the sum of the confidence levels of the plurality of class labels is 1. It should be appreciated that in some cases, when the number of classes divided by the multi-classification model cannot be divided by 1, for example, the multi-classification model is used to divide three classes, and the deviation existing between the confidence sum of the plurality of class labels and 1 is within a certain range (e.g., 0-0.2), the confidence sum of the plurality of class labels can still be regarded as 1.

In the above case, step S42 may include the steps of:

and selecting the data to be labeled with a preset classification result from the other data to be labeled, and adding at least part of the selected data to be labeled into the target data set.

Wherein the preset classification result is a classification result in which the confidence of at least one classification label is 40% -60%. It should be understood that 40% -60% may be replaced by other ranges, such as 30% -70%.

Optionally, in this embodiment, in order to avoid an excessively large number of manual annotations, at least part of the selected data to be annotated is added to the target dataset, and the following steps may be implemented:

The preset number may be 2% to 5% of the plurality of pieces of data to be labeled, and may also be a greater or lesser number.

And step S43, training the classification model according to the artificial marking result of the data to be marked in the target data set.

In each iteration process, after the at least part of data to be labeled is added to the target data set, a manual labeling result of the at least part of data to be labeled can be obtained. In practical application, because different people have different understandings of the same data to be labeled or the same category label, the manual labeling result of the data to be labeled may have a deviation from the category to which the data to be labeled actually belongs. In this embodiment, multiple users are adopted to label the same data to be labeled, and then a manual labeling result of the data to be labeled is determined based on a voting method (voting), so as to at least partially improve the above problem.

Correspondingly, the manual annotation result of the at least part of the data to be annotated can be obtained through the steps shown in fig. 7.

Step S71, for each to-be-labeled data in the at least part of to-be-labeled data, obtaining a plurality of category labels of the to-be-labeled data input by different users.

And step S72, selecting the category label with the largest occurrence frequency from the plurality of labels, and adding the category label to the data to be labeled to obtain the manual labeling result of the data to be labeled.

In implementation, for each data to be labeled, a plurality of users may add a category label to the data to be labeled, so that the data to be labeled has a plurality of category labels. As described above, the plurality of category labels may be different due to interpersonal bias in understanding. Thus, the category label with the highest occurrence frequency (or occurrence frequency) among the plurality of category labels may be added to the data to be labeled. The manual labeling result of the data to be labeled comprises the data to be labeled and a category label added to the data to be labeled.

In one example, if there are 5 users that add category label a, category label b, and category label a to a certain data x to be labeled, the category label a with the most occurrence number (2 times) may be added to the data x to be labeled.

Optionally, the number of the plurality of users may be 3 to 5, and certainly may be more, which is not limited in this embodiment.

It should be understood that the manual labeling result mentioned in the above step S52 can also be obtained by using the steps shown in fig. 6.

The same data to be labeled is labeled by a plurality of users, and the manual labeling result of the data to be labeled is determined based on a voting method, so that the classification precision and the generalization capability of the classification model can be obviously improved in subsequent training. Compared with the random labeling mode, the method can obtain the classification model with the same classification precision through less training data (namely, manual labeling results).

Optionally, after each training of the classification model, that is, after each execution of step S43, it may be determined whether the accuracy of the classification model meets a preset condition, and if the accuracy meets the preset condition, the current classification model is directly used as the trained classification model. If the preset condition is not met, the next iteration process is continued, that is, the process returns to step S41.

Optionally, in this embodiment, whether the accuracy of the classification model meets a preset condition may be determined by:

testing the classification model through a preset test set to obtain a test accuracy;

and if the test accuracy meets the preset condition, taking the classification model as a trained classification model.

In this embodiment, the test set includes a certain number of manual annotation results of the data to be annotated. Optionally, the data to be labeled in the test set may be selected from the multiple pieces of data to be labeled, or may be data to be labeled, which is obtained from an intelligent customer service system (or another system) and is different from the multiple pieces of data to be labeled.

Optionally, the manual annotation result of the data to be annotated in the test set can be obtained through the steps shown in fig. 6.

The test accuracy may refer to: and the quantity of the data to be labeled, which is consistent with the classification result output by the classification model and the manual labeling result in the test set, accounts for the proportion of the quantity of the data to be labeled included in the test set. The preset condition may be that the test accuracy reaches a preset value, and the preset value may be set according to the required classification accuracy, and may be any value between 80% and 100%, for example, 90%.

It should be understood that, in the present embodiment, after step S52 is executed, before the first iteration process is performed on the classification model, whether the accuracy of the classification model meets the preset condition may also be determined in the above manner. In practical applications, in order to meet the user requirement, the preset value in the preset condition is usually set to be higher, and therefore, the iterative process is usually performed at least once.

And step S120, processing at least one part of the multiple pieces of data to be labeled by adopting the trained classification model to obtain an automatic labeling result.

Optionally, in an optional implementation manner, the trained classification model may be used to re-label each piece of data to be labeled in the plurality of pieces of data to be labeled, so as to obtain an automatic labeling result.

In another optional implementation manner, in consideration that all the data to be labeled in the target data set have obtained a manual labeling result in a manual manner, the trained classification model may be used to process other data to be labeled except for the target data set, so as to obtain an automatic labeling result. Further, under the condition that the data in the test set is selected from the plurality of pieces of data to be labeled, the trained classification model can be adopted to process the data to be labeled except the target data set and the test set so as to obtain an automatic labeling result.

Through the mode, most data of the multiple data to be labeled can be labeled through an automatic labeling mode, and the accuracy is improved compared with a manual labeling mode, so that the accuracy of the machine model for supervising learning and training based on the labeling result can be improved to some extent.

The above method is further illustrated by a specific example below.

Taking an intelligent customer service system of a network appointment platform as an example, if a scene to which user question information belongs needs to be accurately identified, and then the user question information is replied based on the setting of the scene, the user question information of each scene needs to be labeled, and then the labeled data of each scene is utilized to train a corresponding scene identification model. In the scenario of the complaint progress query, user question information is generally marked into two categories, one category is a statement unrelated to the complaint progress, and the other category is a statement for expressing the complaint progress intention of the user.

When the method is implemented, the user data can be obtained from the intelligent customer service system, the reply data of the intelligent customer service system is deleted from the user data, and only the user question information is reserved. If 10000 pieces of user question information are finally obtained, each piece of user question information may be regarded as one piece of data to be labeled in this embodiment, and the 10000 pieces of data to be labeled are the multiple pieces of data to be labeled in this embodiment.

The data annotation method provided in this embodiment may include the following steps to implement annotation on the 10000 pieces of data to be annotated.

Firstly, 500 pieces of user question information are selected from 10000 pieces of user question information, at least 3 category labels which are set for the user question information by people are obtained aiming at each piece of user question information, one category label with the largest occurrence frequency is selected from the category labels as a manual labeling result of the user question information, and then 500 manual labeling results are obtained.

Secondly, the obtained 500 manual labeling results are used as a test set for testing the accuracy of the classification model.

Wherein the test set may be stored into an ElasticSearch. Through the steps, 9500 pieces of unmarked user question information are left.

Thirdly, 200 pieces of user question information are selected from the remaining 9500 pieces of user question information and added to the target data set. At least 3 category labels which are set for the user question information by people are obtained according to the user question information in the 200 pieces of user question information, and one category label with the largest occurrence frequency is selected from the category labels as a manual labeling result of the user question information, so that 200 manual labeling results are obtained.

Wherein the target data set may be stored in the aforementioned ElasticSearch.

Fourthly, training a pre-established classification model (assumed to be a FastText model) by using 200 manual labeling results in the target data set.

And fifthly, for the current FastText model (namely, the FastText model obtained through the training in the step four), respectively inputting 500 pieces of user question information in the test set into the FastText model to obtain the classification result of the 500 pieces of user question information, determining the user question information of which the classification result is consistent with the manual labeling result from the 500 pieces of user question information, and calculating the proportion of the determined user question information in the 500 pieces of user question information. If the ratio is smaller than the preset value (for example, 90%), the next step is executed. And (5) actually measuring that the proportion calculated in the step five is 55%, and 55% is the accuracy of the current FastText model.

Sixthly, predicting the rest 9300 pieces of user question information by adopting the classification model obtained by training to obtain the category label of each piece of user question information and the probability (namely, the confidence coefficient) that the user question information belongs to the category indicated by the category label.

Seventhly, determining user question information with the probability of the category label lower than 0.6 from 9300 pieces of user question information; and sequencing the determined user question information according to the probability of the category label, and selecting 200 pieces of user question information to be added into the target data set according to the sequencing result and the sequence of the probability of the category label from small to large.

And eighthly, aiming at the 200 pieces of user question information in the step seven, 200 manual labeling results are obtained according to the mode described in the step three.

And ninthly, training by adopting a FastText model of the manual labeling result of 400 pieces of user question information in the target data set.

And tenth, testing the current FastText model (namely the FastText model obtained through the training in the ninth step) by adopting the test set according to the mode in the fifth step. If the accuracy obtained by the test is lower than 90%, the sixth step to the tenth step can be repeatedly executed, and if the accuracy obtained by the test reaches 90%, the current FastText model can be determined as the FastText model after the training is finished.

Referring to fig. 8, fig. 8 is a table of accuracy comparison obtained by the above procedure and based on random labeling. When the user question information in the target data set reaches 2000 pieces, the FastText model is trained through the manual labeling result of the 2000 pieces of user question information, the accuracy of the obtained FastText model can reach 91%, and the FastText model meets preset conditions. Training is carried out in a random labeling mode, 9500 manual labeling results are required for training, and the accuracy can reach 91%. Therefore, by the data annotation method provided by the embodiment of the application, the manual annotation can be reduced by 80% under the condition of improving the accuracy, and the development cycle of the service is shortened.

It should be understood that the above examples, while given for a "complaint progress query scenario," the principles and procedures described above apply in other scenarios, such as service score and first complaint scenarios, among others.

Referring to fig. 9, fig. 9 is a block diagram of a data annotation device 110 provided in an embodiment of the present application, where functions implemented by the data annotation device 110 correspond to steps of the data annotation method. The data annotation device 110 may be understood as the data processing apparatus 100, or the processor 130 of the data processing apparatus 100, or may be understood as a component which is independent from the data processing apparatus 100 and implements the functions of the embodiments of the present application under the control of the data processing apparatus 100. As shown in FIG. 9, the data annotation device 110 can include a training module 111 and an automatic annotation module 112.

The training module 111 is configured to perform at least one iteration on a preset classification model, so that the accuracy of the classification model meets a preset condition, and a trained classification model is obtained.

Wherein each iteration process comprises:

The automatic labeling module 112 is configured to process at least a part of the data to be labeled by using the trained classification model to obtain an automatic labeling result.

Optionally, in this embodiment, the classification result may include a category label and a confidence of the category label. In this case, the training module 111 may select, according to the confidence of the classification result, at least a part of the data to be labeled, which has a confidence in a preset range, from the other data to be labeled, and add the selected part of the data to be labeled to the target data set by:

Further, the training module 111 adds at least part of the selected data to be labeled to the target dataset by:

Optionally, the classification result may include a plurality of class labels and the confidence of each class label, and the sum of the confidence of the class labels is 1. In this case, the training module 111 may select at least part of the data to be labeled, of the other data to be labeled, with the confidence level within a preset range, from the other data to be labeled, and add the selected data to the target data set according to the confidence level of the classification result in the following manner:

Wherein the preset classification result is a classification result in which the confidence of at least one class label is 40% -60%.

Further, the training module 111 may add at least part of the selected data to be labeled to the target data set by:

Optionally, the training module 111 may be further configured to, during each iteration, obtain a manual labeling result of at least part of the data to be labeled before training the classification model according to the manual labeling result of the data to be labeled in the target data set.

Optionally, in this embodiment, the training module 111 may obtain the manual labeling result of at least part of the data to be labeled by:

aiming at each data to be labeled in at least part of the data to be labeled, acquiring a plurality of category labels of the data to be labeled, which are input by different users;

and selecting the category label with the largest occurrence frequency from the plurality of labels, and adding the category label to the data to be labeled to obtain the manual labeling result of the data to be labeled.

Optionally, in this embodiment, the data annotation device 110 may further include a data storage module 113.

The data storage module 113 is configured to store the manual annotation result of the at least part of the data to be annotated in a search engine supporting a visualization tool; and saving the automatic labeling result to the search engine.

Optionally, in this embodiment, the data annotation device 110 may further include a pre-training module 114.

The pre-training module 114 is configured to determine an empty set as the target data set before the training module 111 is executed; or selecting a part of the plurality of pieces of data to be labeled as a target data set, and training a pre-established classification model according to an artificial labeling result of the data to be labeled in the target data set to obtain the preset classification model.

Optionally, in this embodiment, the training module 111 may be further configured to, during each iteration, train the classification model according to a manual labeling result of data to be labeled in the target data set, and then test the classification model through a preset test set to obtain a test accuracy; and if the test accuracy meets the preset condition, taking the classification model as the trained classification model.

Optionally, the automatic labeling module 112 may be specifically configured to process each piece of data to be labeled by using the trained classification model; or processing other data to be labeled except the target data set in the multiple pieces of data to be labeled by adopting the trained classification model.

The various modules described above may be connected or in communication with each other via a wired or wireless connection. The wired connection may include a metal cable, an optical cable, a hybrid cable, etc., or any combination thereof, among others. The wireless connection may comprise a connection in the form of a LAN, WAN, bluetooth, ZigBee, NFC, or the like, or any combination thereof. Two or more modules may be combined into a single module, and any one module may be divided into two or more units.

The embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the steps of the data annotation method are performed.

To sum up, the data labeling method, the data labeling device, and the data processing apparatus provided in the embodiments of the present application perform the following steps at least once on a classification model, so that the accuracy of the classification model meets a preset condition: inputting other data to be labeled in the data to be labeled except the target data set into a classification model respectively to obtain a classification result; selecting at least part of data to be labeled, of which the confidence degrees of the classification results are in a preset range, from the other data to be labeled, and adding the data to be labeled into a target data set; and training a classification module according to the manual labeling result of the data to be labeled in the target data set. And processing at least one part of the data to be labeled by the classification model to obtain an automatic labeling result. Through the design, automatic labeling of batch data can be realized under the condition of improving data labeling quality.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific work engineering of the apparatus described above may refer to the corresponding process in the method embodiment, and is not described herein again. In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described apparatus embodiments are merely illustrative, and for example, the division of the above-described modules is merely a logical division, and the actual implementation may be achieved by another division, and for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or modules through some communication interfaces, and may be in an electrical, mechanical or other form.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for annotating data, the method comprising:

wherein each iteration process comprises:

2. The method of claim 1, wherein the classification result comprises a class label and a confidence level of the class label;

3. The method of claim 2, wherein adding at least some of the selected data to be labeled to the target dataset comprises:

4. The method of claim 1, wherein the classification result comprises a plurality of class labels and a confidence level of each class label, and the sum of the confidence levels of the class labels is 1;

5. The method of claim 4, wherein adding at least some of the selected data to be labeled to the target dataset comprises:

6. The method according to any one of claims 1-5, wherein each iteration process further comprises:

7. The method according to claim 6, wherein obtaining the manual annotation result of the at least part of the data to be annotated comprises:

8. The method of claim 6, further comprising:

and saving the automatic labeling result into the search engine.

9. The method according to any of claims 1-5, wherein prior to performing the first iterative process on the pre-established classification model, the method further comprises:

10. The method according to any one of claims 1-5, wherein each iteration process further comprises:

11. The method according to any one of claims 1 to 5, wherein processing at least a part of the plurality of pieces of data to be labeled using the trained classification model comprises:

12. A data annotation device, said device comprising:

wherein each iteration process comprises:

13. The apparatus of claim 12, wherein the classification result comprises a class label and a confidence level of the class label;

14. The apparatus of claim 13, wherein the training module adds at least some of the selected data to be labeled to the target dataset by:

15. The apparatus of claim 12, wherein the classification result comprises a plurality of class labels and a confidence level of each class label, and a sum of the confidence levels of the class labels is 1;

16. The apparatus of claim 15, wherein the training module adds at least some of the selected data to be labeled to the target dataset by:

17. The apparatus according to any one of claims 12 to 16, wherein the training module is further configured to, during each iterative process, obtain a manual labeling result of the at least part of the data to be labeled before training the classification model according to the manual labeling result of the data to be labeled in the target data set.

18. The apparatus of claim 17, wherein the training module obtains the manual labeling result of the at least part of the data to be labeled by:

19. The apparatus of claim 17, further comprising:

20. The apparatus according to any one of claims 12-16, further comprising:

21. The apparatus according to any one of claims 12 to 16, wherein the training module is further configured to, during each iterative process, after training the classification model according to the manual labeling result of the data to be labeled in the target data set, test the classification model through a preset test set to obtain a test accuracy; and if the test accuracy meets the preset condition, taking the classification model as the trained classification model.

22. The apparatus according to any one of claims 12 to 16, wherein the automatic labeling module is specifically configured to process each piece of data to be labeled by using the trained classification model; or processing other data to be labeled except the target data set in the multiple pieces of data to be labeled by adopting the trained classification model.

23. A data processing apparatus, characterized by comprising: a processor, a storage medium and a bus, the storage medium storing machine-readable instructions executable by the data processing apparatus, the processor and the storage medium communicating via the bus when the data processing apparatus is running, the processor executing the machine-readable instructions to perform the steps of the data annotation method according to any one of claims 1 to 11 when executed.

24. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the data annotation method according to any one of claims 1 to 11.