CN111340054A - Data labeling method and device and data processing equipment - Google Patents

Data labeling method and device and data processing equipment Download PDF

Info

Publication number
CN111340054A
CN111340054A CN201811549912.9A CN201811549912A CN111340054A CN 111340054 A CN111340054 A CN 111340054A CN 201811549912 A CN201811549912 A CN 201811549912A CN 111340054 A CN111340054 A CN 111340054A
Authority
CN
China
Prior art keywords
data
labeled
classification model
result
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811549912.9A
Other languages
Chinese (zh)
Inventor
冯浩
徐江
王鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Didi Infinity Technology and Development Co Ltd
Original Assignee
Beijing Didi Infinity Technology and Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Didi Infinity Technology and Development Co Ltd filed Critical Beijing Didi Infinity Technology and Development Co Ltd
Priority to CN201811549912.9A priority Critical patent/CN111340054A/en
Publication of CN111340054A publication Critical patent/CN111340054A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Abstract

The application provides a data labeling method, a data labeling device and data processing equipment, wherein the method comprises the following steps: performing at least one iteration process on the classification model so as to enable the accuracy of the classification model to meet a preset condition; and processing at least one part of the data to be labeled by using the obtained classification model to obtain an automatic labeling result. Wherein each iteration process comprises: inputting other data to be labeled in the data to be labeled except the target data set into a classification model respectively to obtain a classification result; selecting at least part of data to be labeled, of which the confidence degrees of the classification results are in a preset range, from the other data to be labeled, and adding the data to be labeled into a target data set; and training a classification module according to the manual labeling result of the data to be labeled in the target data set. Therefore, automatic labeling of batch data can be realized under the condition of improving the quality of data labeling.

Description

Data labeling method and device and data processing equipment
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a data annotation method, an apparatus and a data processing device.
Background
With the development of computer technology, machine learning algorithms are more and more widely applied, and supervised learning algorithms are one of the algorithms commonly used. The supervised learning algorithm usually needs to train a pre-established recognition model by adopting a large amount of labeled data, and the quantity and accuracy of the labeled data directly influence the accuracy of the trained recognition model.
At present, the labeling data is mainly obtained by manually adding labels, so that the efficiency is low, errors are easy to occur, and the accuracy of the model obtained by final training is low.
Disclosure of Invention
In view of this, an object of the embodiments of the present application is to provide a data annotation method, an apparatus and a data processing device, which can implement automatic annotation on batch data under the condition of improving annotation accuracy.
According to an aspect of the present application, there is provided a data annotation method, the method comprising:
performing at least one iteration treatment on a preset classification model to enable the accuracy of the classification model to meet a preset condition, so as to obtain a trained classification model;
processing at least one part of the multiple pieces of data to be labeled by adopting the trained classification model to obtain an automatic labeling result;
wherein each iteration process comprises:
inputting other data to be labeled except for a target data set in the multiple data to be labeled into the classification model respectively to obtain respective classification results of the other data to be labeled; according to the confidence degree of the classification result, at least part of data to be labeled with the confidence degree within a preset range is selected from the other data to be labeled and added into the target data set; and training the classification model according to the manual labeling result of the data to be labeled in the target data set.
In a possible embodiment, the classification result includes a category label and a confidence of the category label;
selecting at least part of data to be labeled with the confidence degree within a preset range from the other data to be labeled according to the confidence degree of the classification result, and adding the data to be labeled into the target data set, wherein the method comprises the following steps:
and selecting data to be labeled with the confidence level lower than a preset threshold value from the other data to be labeled, and adding at least part of the selected data to be labeled into the target data set.
In a possible implementation, adding at least part of the selected data to be labeled to the target dataset includes:
sorting the selected data to be labeled according to the confidence degree of the classification result;
and according to the sorting result, sequentially selecting a preset number of pieces of data to be marked from the selected data to be marked according to the sequence of the confidence degrees from small to large, and adding the data to be marked into the target data set.
In a possible embodiment, the classification result includes a plurality of class labels and the confidence of each class label, and the sum of the confidence of the class labels is 1;
selecting at least part of data to be labeled with the confidence degree within a preset range from the other data to be labeled according to the confidence degree of the classification result, and adding the data to be labeled into the target data set, wherein the method comprises the following steps:
selecting data to be labeled with a preset classification result from the other data to be labeled, and adding at least part of the selected data to be labeled into the target data set; wherein the preset classification result is a classification result in which the confidence of at least one class label is 40% -60%.
In a possible implementation, adding at least part of the selected data to be labeled to the target dataset includes:
randomly selecting a preset number of pieces of data to be marked from the selected data to be marked, and adding the data to be marked into the target data set.
In a possible implementation, each iteration process further includes:
and acquiring the artificial labeling result of at least part of the data to be labeled before training the classification model according to the artificial labeling result of the data to be labeled in the target data set.
In a possible implementation manner, the obtaining of the manual annotation result of the at least part of the data to be annotated includes:
aiming at each data to be marked in at least part of the data to be marked, acquiring a plurality of labels of the data to be marked, which are input by different users;
and selecting the label with the largest occurrence frequency from the plurality of labels, and adding the label to the data to be labeled to obtain the artificial labeling result of the data to be labeled.
In one possible embodiment, the method further comprises:
storing the manual labeling result of at least part of the data to be labeled into a search engine supporting a visualization tool;
and saving the automatic labeling result into the search engine.
In a possible implementation, before performing the iterative processing on the pre-established classification model for the first time, the method further includes:
determining an empty set as the target data set; alternatively, the first and second electrodes may be,
and selecting a part of the plurality of data to be labeled as a target data set, and training a pre-established classification model according to an artificial labeling result of the data to be labeled in the target data set to obtain the preset classification model.
In a possible implementation, each iteration process further includes:
after training the classification model according to the manual labeling result of the data to be labeled in the target data set, testing the classification model through a preset test set to obtain test accuracy;
and if the test accuracy meets the preset condition, taking the classification model as the trained classification model.
In a possible implementation manner, processing at least a part of the plurality of pieces of data to be labeled by using the trained classification model includes:
processing each piece of data to be labeled by adopting the trained classification model; alternatively, the first and second electrodes may be,
and processing other data to be labeled except the target data set in the multiple pieces of data to be labeled by adopting the trained classification model.
According to another aspect of the present application, there is provided a data annotation apparatus, the apparatus comprising:
the training module is used for carrying out at least one iteration treatment on a preset classification model so as to enable the accuracy of the classification model to accord with a preset condition and obtain a trained classification model;
the automatic labeling module is used for processing at least one part of the data to be labeled by adopting the trained classification model to obtain an automatic labeling result;
wherein each iteration process comprises:
inputting other data to be labeled except for a target data set in a plurality of data to be labeled into the classification model respectively to obtain respective classification results of the other data to be labeled; according to the confidence degree of the classification result, at least part of data to be labeled with the confidence degree within a preset range is selected from the other data to be labeled and added into the target data set; and training the classification model according to the manual labeling result of the data to be labeled in the target data set.
In a possible embodiment, the classification result includes a category label and a confidence of the category label;
the training module selects at least part of data to be labeled with the confidence degree within a preset range from the other data to be labeled according to the confidence degree of the classification result, and adds the selected data to the target data set in a mode that:
and selecting data to be labeled with the confidence level lower than a preset threshold value from the other data to be labeled, and adding at least part of the selected data to be labeled into the target data set.
In one possible embodiment, the training module adds at least part of the selected data to be labeled to the target dataset by:
sorting the selected data to be labeled according to the confidence degree of the classification result;
and according to the sorting result, sequentially selecting a preset number of pieces of data to be marked from the selected data to be marked according to the sequence of the confidence degrees from small to large, and adding the data to be marked into the target data set.
In a possible embodiment, the classification result includes a plurality of class labels and the confidence of each class label, and the sum of the confidence of the class labels is 1;
the training module selects at least part of data to be labeled with the confidence degree within a preset range from the other data to be labeled according to the confidence degree of the classification result, and adds the selected data to the target data set in a mode that:
selecting data to be labeled with a preset classification result from the other data to be labeled, and adding at least part of the selected data to be labeled into the target data set; wherein the preset classification result is a classification result in which the confidence of at least one class label is 40% -60%.
In one possible embodiment, the training module adds at least part of the selected data to be labeled to the target dataset by:
randomly selecting a preset number of pieces of data to be marked from the selected data to be marked, and adding the data to be marked into the target data set.
In a possible implementation manner, the training module is further configured to, during each iteration, obtain a manual labeling result of at least part of the data to be labeled before training the classification model according to the manual labeling result of the data to be labeled in the target data set.
In a possible implementation manner, the training module obtains the manual annotation result of the at least part of the data to be annotated by:
aiming at each data to be marked in at least part of the data to be marked, acquiring a plurality of labels of the data to be marked, which are input by different users;
and selecting the label with the largest occurrence frequency from the plurality of labels, and adding the label to the data to be labeled to obtain the artificial labeling result of the data to be labeled.
In a possible embodiment, the apparatus further comprises:
the data storage module is used for storing the manual labeling result of at least part of the data to be labeled into a search engine supporting a visualization tool; and saving the automatic labeling result to the search engine.
In a possible embodiment, the apparatus further comprises:
a pre-training module for determining an empty set as the target data set before running the training module; or selecting a part of the plurality of pieces of data to be labeled as a target data set, and training a pre-established classification model according to an artificial labeling result of the data to be labeled in the target data set to obtain the preset classification model.
In a possible implementation manner, the training module is further configured to, during each iteration process, train the classification model according to an artificial labeling result of data to be labeled in the target data set, and then test the classification model through a preset test set to obtain a test accuracy; and if the test accuracy meets the preset condition, taking the classification model as the trained classification model.
In a possible implementation manner, the automatic labeling module is specifically configured to process each piece of data to be labeled by using the trained classification model; or processing other data to be labeled except the target data set in the multiple pieces of data to be labeled by adopting the trained classification model.
According to another aspect of the present application, there is provided a data processing apparatus comprising: the data marking device comprises a processor, a storage medium and a bus, wherein the storage medium stores machine readable instructions executable by the data processing device, when the data processing device runs, the processor and the storage medium communicate through the bus, and the processor executes the machine readable instructions to execute the steps of the data marking method.
According to another aspect of the present application, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the above-described data annotation method.
Based on any one of the above aspects, the data labeling method, the data labeling device, and the data processing device provided in the embodiments of the present application perform the following steps at least once on the classification model, so that the accuracy of the classification model meets the preset condition: inputting other data to be labeled in the data to be labeled except the target data set into a classification model respectively to obtain a classification result; selecting at least part of data to be labeled, of which the confidence degrees of the classification results are in a preset range, from the other data to be labeled, and adding the data to be labeled into a target data set; and training a classification module according to the manual labeling result of the data to be labeled in the target data set. And processing at least one part of the data to be labeled by the classification model to obtain an automatic labeling result. Through the design, automatic labeling of batch data can be realized under the condition of improving data labeling quality.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.
Fig. 1 is a schematic view of an application scenario of a data processing device according to an embodiment of the present application;
fig. 2 is a schematic hardware structure diagram of a data processing device according to an embodiment of the present application;
fig. 3 is a schematic flowchart of a data annotation method according to an embodiment of the present application;
FIG. 4 is a flow chart of an iterative process provided by an embodiment of the present application;
FIG. 5 is a schematic diagram of a pre-training step provided in an embodiment of the present application;
FIG. 6 is a schematic diagram illustrating an effect of active learning and random labeling according to an embodiment of the present application;
fig. 7 is a schematic flowchart of a process for obtaining a manual annotation result according to an embodiment of the present application;
FIG. 8 is a table of comparison accuracy between active learning and random labeling provided in an embodiment of the present application;
fig. 9 is a block diagram of a data annotation device according to an embodiment of the present application.
Icon: 100-a data processing device; 110-data annotation means; 111-a training module; 112-automatic labeling module; 113-a data storage module; 114-a pre-training module; 120-a storage medium; 130-a processor; 140-system bus; 150-network port; 160-I/O interface; 200-a data providing device; 300-a data storage device; 310-a database; 400-network.
Detailed Description
In order to make the purpose, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it should be understood that the drawings in the present application are for illustrative and descriptive purposes only and are not used to limit the scope of protection of the application. Also, it should be noted that the schematic drawings are not drawn to scale. The flowcharts used in this application illustrate the operations implemented by some embodiments of the present application. It should be understood that the operations of the flow diagrams may be performed out of order, and steps without logical context may be performed in reverse order or simultaneously. One skilled in the art, under the guidance of this application, may add one or more other operations to, or remove one or more operations from, the flowchart.
In addition, the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.
The solutions provided in the present application are set forth in connection with a specific application scenario "intelligent customer service system" in order to enable those skilled in the art to use the present application. It should be understood that the intelligent customer service system described herein may be any platform of customer service systems, such as customer service systems that may be networked car appointment platforms, express platforms, online transportation platforms, service platforms for both buyer and seller transactions, and the like. The present embodiment is not limited thereto. It will be apparent to those skilled in the art that the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the application. Although the embodiments of the present application are described primarily in the context of a "smart customer service system," it should be understood that this is merely one exemplary embodiment. The application can be applied to any other scene needing to use the supervised learning algorithm. Such as face recognition systems, information recommendation systems, etc.
In the intelligent customer service system, various problems are presented by users, such as service points, complaints, order information change, and the like. Only if the kind of the question posed by the user is accurately recognized, an answer satisfying the user's needs can be given. Currently, a supervised learning algorithm is usually adopted to train a recognition model so as to realize recognition of user problems. This requires a large number of questions posed by the user to be retrieved from the intelligent customer service system, and for each question to be added an accurate category label (i.e., the category to which the question belongs), training data for training the recognition model is obtained. In which a large amount of training data is required to obtain a high-precision recognition model.
In some embodiments, the above-described operation of adding category labels is typically performed manually. When the data volume to be labeled is large, a large amount of manpower and material resources are consumed, and the accuracy of the labeling result is difficult to ensure. Therefore, the present embodiment provides a data annotation method and device based on active learning, and the scheme provided by the present embodiment will be described in detail below.
Referring to fig. 1, in an application scenario of the present embodiment, a data processing apparatus 100 is provided, where the data processing apparatus 100 may communicate with a data providing apparatus 200 and a data storage apparatus 300 through a network 400 to obtain data to be annotated from the data providing apparatus 200, and store an annotation result of the data to be annotated in the data storage apparatus 300. The data providing device 200 may be any server device providing intelligent customer service, and is capable of providing data to be annotated, such as user question information.
The data storage device 300 may be any electronic device having a storage function. In one example, the data storage device 300 may be a server running a database 310. In another example, the database 310 running on the data storage device 300 may be replaced with a search engine supporting a visualization tool, which may be, for example, an ElasticSearch. The ElasticSearch is a lightweight search engine, can quickly search out required data by customizing search rules, and can visually display the searched data. Based on this, the user can search for a specific tagged result by configuring the search condition of the ElasticSearch. For example, in some application scenarios, when data annotation is performed, all possible category labels cannot be provided, so that there is a deviation in the annotation result, and a part of the annotated data may need to be re-annotated in a subsequent process. In the related art, since the specific data with the deviation cannot be determined, all the labeled data are usually re-labeled, which is costly. By the data storage device 300, the newly added tag, the tag associated with the newly added tag, or the keyword associated with the newly added tag can be used as a search condition to search out the labeled data that needs to be re-labeled.
Alternatively, the data storage device 300 may be a single storage device or may be a storage cluster (distributed or centralized). The data storage device 300 may include storage media such as mass Memory, removable Memory, volatile Read-and-write Memory, or Read-Only Memory (ROM), or any combination thereof. By way of example, mass storage may include magnetic disks, optical disks, solid state drives, and the like; removable memory may include flash drives, floppy disks, optical disks, memory cards, zip disks, tapes, and the like; volatile read-write memory may include Random Access Memory (RAM); the RAM may include Dynamic RAM (DRAM), Double data Rate Synchronous Dynamic RAM (DDR SDRAM); static RAM (SRAM), Thyristor-Based Random Access Memory (T-RAM), Zero-capacitor RAM (Zero-RAM), and the like. By way of example, ROMs may include Mask Read-Only memories (MROMs), Programmable ROMs (PROMs), Erasable Programmable ROMs (PERROMs), Electrically Erasable Programmable ROMs (EEPROMs), compact disk ROMs (CD-ROMs), digital versatile disks (ROMs), and the like.
When the data storage device 300 is a storage cluster formed by a plurality of storage devices, the storage medium may be deployed on the plurality of storage devices in a distributed manner.
Optionally, in this embodiment, the data processing device 100, the data providing device 200, and the data storage device 300 may be the same device, or may be different devices, for example, all of which are server devices providing intelligent customer service, and this embodiment is not limited thereto.
Network 400 may be used for the exchange of information and/or data. Network 400 may include a wired Network, a Wireless Network, a fiber optic Network, a telecommunications Network, an intranet, the internet, a Local Area Network (LAN), a Wide Area Network (WAN), a Wireless Local Area Network (WLAN), a Metropolitan Area Network (MAN), a Wide Area Network (WAN), a Public Switched Telephone Network (PSTN), a bluetooth Network, a ZigBee Network, or a Near Field Communication (NFC) Network, among others, or any combination thereof. In some examples, network 400 may include one or more network access points. For example, network 400 may include wired or wireless network access points, such as base stations and/or network switching nodes.
Fig. 2 illustrates a schematic diagram of exemplary hardware and software components of a data processing device 100 that may implement the concepts of the present application, according to some embodiments of the present application. For example, the processor 130 may be used on the data processing device 100 and used to perform the functions in the embodiments of the present application.
Alternatively, the data processing apparatus 100 may be a single electronic device, for example, a server, a personal computer, or other special devices, and the data processing apparatus 100 may also be a cluster formed by a plurality of electronic devices, for example, a server cluster formed by a plurality of servers, and the electronic devices in the cluster may implement the functions described in this embodiment in a distributed manner.
For example, the data processing device 100 may include one or more processors 130 for executing computer programs, a system bus 140, a network port 150 connected to a network, and a storage medium 120 of a different form, such as a disk, ROM, RAM, or any combination thereof. Illustratively, the data processing device 100 may also include computer programs stored in ROM, RAM, or other types of non-transitory storage media, or any combination thereof. According to these computer programs, the methods provided by the embodiments of the present application can be implemented. The data processing device 100 may also include Input/Output (I/O) interfaces 160 with other Input/Output devices (e.g., keyboard, display screen).
In some examples, processor 130 may include one or more processing cores (e.g., a single-core processor or a multi-core processor). Merely as distances, the Processor 130 may include a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), an Application Specific Instruction Set Processor (ASIP), a Graphics Processing Unit (GPU), a Physical Processing Unit (PPU), a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), a Programmable Logic Device (PLD), a controller, a microcontroller Unit, a Reduced Instruction Set Computer (Reduced Instruction Set Computer), a microprocessor, or the like, or any combination thereof.
For convenience of illustration, only one processor is described in the data processing apparatus 100, however, it should be noted that the data processing apparatus 100 of the present application may also include a plurality of processors, and thus the steps performed by one processor described in the present application may also be performed by a plurality of processors in combination or individually. For example, if the processor of the data processing device 100 executes steps a and B, it should be understood that steps a and B may also be executed by two different processors together, or executed separately on one processor. For example, a first processor performs step a and a second processor performs step B, or the first processor and the second processor perform steps a and B together.
It should be understood that the configuration shown in FIG. 2 is merely an example, and that data processing device 100 may also include many more than those shown in FIG. 1
Referring to fig. 3, fig. 3 is a diagram illustrating a data annotation method provided in this embodiment, where the method can be applied to a data processing apparatus 100. The individual steps involved in the method are described in detail below.
Step S110, carrying out at least one iteration process on a preset classification model to ensure that the accuracy of the classification model meets a preset condition, and obtaining the trained classification model.
In this embodiment, the plurality of pieces of data to be labeled may be all pieces of data that need to be labeled currently. The data to be labeled can be user question information acquired from the intelligent customer service system, wherein one piece of data to be labeled can be at least one statement unit in one piece of user question information, for example, the data to be labeled can be complete user question information or one statement in one piece of user question information. Of course, the data to be labeled may be data to be identified obtained from other systems. Such as face images in the face recognition system described above.
The classification model may be any Machine learning classification model, such as a Random Forest (Random Forest) model, a FastText (fast text) model, a Support Vector Machine (Support Vector Machine), and so forth. In a possible implementation, the classification model may employ a text classification model, such as a FastText model, considering that the user question information is usually text information, i.e. the data to be labeled is usually text information. In other possible implementations, the classification model may also be a non-text classification model, such as an image classification model, in which case the data to be annotated may be converted into image information that can be recognized by the image classification model.
In the present embodiment, each iteration process includes the steps shown in fig. 4.
Step S41, respectively inputting other data to be labeled, except for a target data set, in the multiple pieces of data to be labeled into the classification model, and obtaining respective classification results of the other data to be labeled.
The other data to be labeled refers to data to be labeled in the plurality of data to be labeled, except the data to be labeled in the target data set.
In a possible implementation, before the first iteration, the classification model may be an untrained model, and correspondingly, the target data set may be an empty set. In this case, in the first iteration, the other data to be labeled is the plurality of pieces of data to be labeled.
In yet another possible implementation, the classification model may be initially trained before the first iteration is performed. In this case, the method may further include the steps as shown in fig. 5 before the step S41 is performed for the first time.
Step S51, selecting a part of the plurality of pieces of data to be labeled as the target data set.
In an optional manner, 2% -5% d of the data to be annotated from the plurality of data to be annotated may be selected as the target data set. In other alternative manners, a greater or lesser proportion of the data to be annotated may also be selected as the target data set.
Step S52, training a pre-established classification model according to the artificial labeling result of the data to be labeled in the target data set to obtain the preset classification model.
And step S42, selecting at least part of data to be labeled with the confidence coefficient within a preset range from the other data to be labeled according to the confidence coefficient of the classification result, and adding the selected data to be labeled into the target data set.
In this embodiment, the classification result includes at least one category label and a confidence level of each category label, where the confidence level of any category label represents a probability that the input data to be labeled belongs to a category indicated by the category label. For example, after a certain data x to be labeled is input into the classification model, the confidence of the output class label a is 50%, which indicates that: the probability that the data x to be labeled belongs to the category indicated by the category label a is 50%. Correspondingly, it also means that the classification model has difficulty distinguishing the category of the data x to be labeled.
Alternatively, the preset range may be a range indicating that the classification model is difficult to distinguish the data to be labeled, for example, when a category label with a confidence degree of 40% to 60% exists in the classification result of any data to be labeled, the classification model may be considered to be difficult to distinguish the data to be labeled. For another example, in an application scenario with a high requirement on accuracy, when a category label with a confidence of 30% to 70% exists in a classification result of any data to be labeled, it may be considered that the classification model is difficult to distinguish the data to be labeled.
In this embodiment, when the classification model is a binary classification model, each classification result usually includes a class label and a confidence of the class label. In this case, step S42 may include the steps of:
selecting data to be annotated with confidence coefficient lower than a preset threshold value from the other data to be annotated;
and adding at least part of the selected data to be marked into the target data set.
The preset threshold may be set according to the requirement of the application scenario on the classification accuracy, and may be any value of 55-70%, such as 60%, for example.
Through the process, the data to be labeled, which are difficult to distinguish by the classification model, can be screened out, the manual labeling is carried out, and then the classification model is trained according to the manual labeling result. This approach is called active learning, and by active learning, the accuracy of the classification model can be improved.
For example, fig. 6(a-c) shows an effect diagram of active learning, where fig. 6(a) shows a plurality of pieces of data to be labeled in an example, the plurality of pieces of data to be labeled include two types of data, and the two types of data are respectively represented by circles and triangles.
In each iteration process, if a random labeling manner is adopted, the part of the data to be labeled shown in fig. 6(a) is manually labeled, a classification model is trained according to an obtained manual labeling result, and then the classification model obtained through training is adopted to classify the data to be labeled in fig. 6(a), so that a classification plane S1 shown in fig. 6(b) can be obtained. Wherein circles and triangles with shaded portions represent artificial labeling results.
In each iteration process, if an active learning manner is adopted, that is, part of the data to be labeled is selected for manual labeling in the manner described in the above step S42, a classification model is trained according to the obtained manual labeling result, and then the obtained classification model is used for classifying the data to be labeled in fig. 6(a), so as to obtain a classification plane S2 shown in fig. 6 (c). Wherein circles and triangles with shaded portions represent artificial labeling results. Combining fig. 6(b) and fig. 6(c), it can be seen that the classification model obtained by the active learning method has higher classification accuracy.
In order to reduce the workload of manual annotation, a part of the data to be annotated, the confidence of which is lower than a preset threshold value, can be selected for manual annotation. Correspondingly, the adding of at least part of the selected data to be labeled to the target dataset can be realized by the following steps:
sorting the selected data to be labeled according to the confidence degree of the classification result;
and according to the sorting result, sequentially selecting a preset number of pieces of data to be marked from the selected data to be marked according to the sequence of the confidence degrees from small to large, and adding the data to be marked into the target data set.
In this embodiment, the preset number may be determined according to the total amount of the data to be labeled, which needs to be labeled, that is: the method can be determined according to the total amount of the plurality of pieces of data to be annotated. For example, the number of the pieces of data to be labeled may be 2% -5%, and may be a greater or lesser proportional number. The above-mentioned
In an example, if there are 5 pieces of the selected data to be labeled, the confidence degrees of the classification results are 50%, 52%, 59%, 60%, and 54%, respectively, and assuming that the preset number is 3, three pieces of data to be labeled corresponding to three classification results with confidence degrees of 50%, 52%, and 54% may be selected and added to the target data set.
In yet another example, if there are 6 pieces of selected data to be labeled, the confidence degrees of the classification results are 50%, 52%, 58%, 59%, and 60%, respectively, and assuming that the preset number is 3, two pieces of data to be labeled corresponding to two classification results with a confidence degree of 50% and two pieces of data to be labeled corresponding to a classification result with a confidence degree of 52 may be selected and added to the target data set.
In this embodiment, when the classification model is a multi-classification model, each classification result usually includes a plurality of class labels and the confidence of each class label. Wherein the sum of the confidence levels of the plurality of class labels is 1. It should be appreciated that in some cases, when the number of classes divided by the multi-classification model cannot be divided by 1, for example, the multi-classification model is used to divide three classes, and the deviation existing between the confidence sum of the plurality of class labels and 1 is within a certain range (e.g., 0-0.2), the confidence sum of the plurality of class labels can still be regarded as 1.
In the above case, step S42 may include the steps of:
and selecting the data to be labeled with a preset classification result from the other data to be labeled, and adding at least part of the selected data to be labeled into the target data set.
Wherein the preset classification result is a classification result in which the confidence of at least one classification label is 40% -60%. It should be understood that 40% -60% may be replaced by other ranges, such as 30% -70%.
Optionally, in this embodiment, in order to avoid an excessively large number of manual annotations, at least part of the selected data to be annotated is added to the target dataset, and the following steps may be implemented:
randomly selecting a preset number of pieces of data to be marked from the selected data to be marked, and adding the data to be marked into the target data set.
The preset number may be 2% to 5% of the plurality of pieces of data to be labeled, and may also be a greater or lesser number.
And step S43, training the classification model according to the artificial marking result of the data to be marked in the target data set.
In each iteration process, after the at least part of data to be labeled is added to the target data set, a manual labeling result of the at least part of data to be labeled can be obtained. In practical application, because different people have different understandings of the same data to be labeled or the same category label, the manual labeling result of the data to be labeled may have a deviation from the category to which the data to be labeled actually belongs. In this embodiment, multiple users are adopted to label the same data to be labeled, and then a manual labeling result of the data to be labeled is determined based on a voting method (voting), so as to at least partially improve the above problem.
Correspondingly, the manual annotation result of the at least part of the data to be annotated can be obtained through the steps shown in fig. 7.
Step S71, for each to-be-labeled data in the at least part of to-be-labeled data, obtaining a plurality of category labels of the to-be-labeled data input by different users.
And step S72, selecting the category label with the largest occurrence frequency from the plurality of labels, and adding the category label to the data to be labeled to obtain the manual labeling result of the data to be labeled.
In implementation, for each data to be labeled, a plurality of users may add a category label to the data to be labeled, so that the data to be labeled has a plurality of category labels. As described above, the plurality of category labels may be different due to interpersonal bias in understanding. Thus, the category label with the highest occurrence frequency (or occurrence frequency) among the plurality of category labels may be added to the data to be labeled. The manual labeling result of the data to be labeled comprises the data to be labeled and a category label added to the data to be labeled.
In one example, if there are 5 users that add category label a, category label b, and category label a to a certain data x to be labeled, the category label a with the most occurrence number (2 times) may be added to the data x to be labeled.
Optionally, the number of the plurality of users may be 3 to 5, and certainly may be more, which is not limited in this embodiment.
It should be understood that the manual labeling result mentioned in the above step S52 can also be obtained by using the steps shown in fig. 6.
The same data to be labeled is labeled by a plurality of users, and the manual labeling result of the data to be labeled is determined based on a voting method, so that the classification precision and the generalization capability of the classification model can be obviously improved in subsequent training. Compared with the random labeling mode, the method can obtain the classification model with the same classification precision through less training data (namely, manual labeling results).
Optionally, after each training of the classification model, that is, after each execution of step S43, it may be determined whether the accuracy of the classification model meets a preset condition, and if the accuracy meets the preset condition, the current classification model is directly used as the trained classification model. If the preset condition is not met, the next iteration process is continued, that is, the process returns to step S41.
Optionally, in this embodiment, whether the accuracy of the classification model meets a preset condition may be determined by:
testing the classification model through a preset test set to obtain a test accuracy;
and if the test accuracy meets the preset condition, taking the classification model as a trained classification model.
In this embodiment, the test set includes a certain number of manual annotation results of the data to be annotated. Optionally, the data to be labeled in the test set may be selected from the multiple pieces of data to be labeled, or may be data to be labeled, which is obtained from an intelligent customer service system (or another system) and is different from the multiple pieces of data to be labeled.
Optionally, the manual annotation result of the data to be annotated in the test set can be obtained through the steps shown in fig. 6.
The test accuracy may refer to: and the quantity of the data to be labeled, which is consistent with the classification result output by the classification model and the manual labeling result in the test set, accounts for the proportion of the quantity of the data to be labeled included in the test set. The preset condition may be that the test accuracy reaches a preset value, and the preset value may be set according to the required classification accuracy, and may be any value between 80% and 100%, for example, 90%.
It should be understood that, in the present embodiment, after step S52 is executed, before the first iteration process is performed on the classification model, whether the accuracy of the classification model meets the preset condition may also be determined in the above manner. In practical applications, in order to meet the user requirement, the preset value in the preset condition is usually set to be higher, and therefore, the iterative process is usually performed at least once.
And step S120, processing at least one part of the multiple pieces of data to be labeled by adopting the trained classification model to obtain an automatic labeling result.
Optionally, in an optional implementation manner, the trained classification model may be used to re-label each piece of data to be labeled in the plurality of pieces of data to be labeled, so as to obtain an automatic labeling result.
In another optional implementation manner, in consideration that all the data to be labeled in the target data set have obtained a manual labeling result in a manual manner, the trained classification model may be used to process other data to be labeled except for the target data set, so as to obtain an automatic labeling result. Further, under the condition that the data in the test set is selected from the plurality of pieces of data to be labeled, the trained classification model can be adopted to process the data to be labeled except the target data set and the test set so as to obtain an automatic labeling result.
Through the mode, most data of the multiple data to be labeled can be labeled through an automatic labeling mode, and the accuracy is improved compared with a manual labeling mode, so that the accuracy of the machine model for supervising learning and training based on the labeling result can be improved to some extent.
The above method is further illustrated by a specific example below.
Taking an intelligent customer service system of a network appointment platform as an example, if a scene to which user question information belongs needs to be accurately identified, and then the user question information is replied based on the setting of the scene, the user question information of each scene needs to be labeled, and then the labeled data of each scene is utilized to train a corresponding scene identification model. In the scenario of the complaint progress query, user question information is generally marked into two categories, one category is a statement unrelated to the complaint progress, and the other category is a statement for expressing the complaint progress intention of the user.
When the method is implemented, the user data can be obtained from the intelligent customer service system, the reply data of the intelligent customer service system is deleted from the user data, and only the user question information is reserved. If 10000 pieces of user question information are finally obtained, each piece of user question information may be regarded as one piece of data to be labeled in this embodiment, and the 10000 pieces of data to be labeled are the multiple pieces of data to be labeled in this embodiment.
The data annotation method provided in this embodiment may include the following steps to implement annotation on the 10000 pieces of data to be annotated.
Firstly, 500 pieces of user question information are selected from 10000 pieces of user question information, at least 3 category labels which are set for the user question information by people are obtained aiming at each piece of user question information, one category label with the largest occurrence frequency is selected from the category labels as a manual labeling result of the user question information, and then 500 manual labeling results are obtained.
Secondly, the obtained 500 manual labeling results are used as a test set for testing the accuracy of the classification model.
Wherein the test set may be stored into an ElasticSearch. Through the steps, 9500 pieces of unmarked user question information are left.
Thirdly, 200 pieces of user question information are selected from the remaining 9500 pieces of user question information and added to the target data set. At least 3 category labels which are set for the user question information by people are obtained according to the user question information in the 200 pieces of user question information, and one category label with the largest occurrence frequency is selected from the category labels as a manual labeling result of the user question information, so that 200 manual labeling results are obtained.
Wherein the target data set may be stored in the aforementioned ElasticSearch.
Fourthly, training a pre-established classification model (assumed to be a FastText model) by using 200 manual labeling results in the target data set.
And fifthly, for the current FastText model (namely, the FastText model obtained through the training in the step four), respectively inputting 500 pieces of user question information in the test set into the FastText model to obtain the classification result of the 500 pieces of user question information, determining the user question information of which the classification result is consistent with the manual labeling result from the 500 pieces of user question information, and calculating the proportion of the determined user question information in the 500 pieces of user question information. If the ratio is smaller than the preset value (for example, 90%), the next step is executed. And (5) actually measuring that the proportion calculated in the step five is 55%, and 55% is the accuracy of the current FastText model.
Sixthly, predicting the rest 9300 pieces of user question information by adopting the classification model obtained by training to obtain the category label of each piece of user question information and the probability (namely, the confidence coefficient) that the user question information belongs to the category indicated by the category label.
Seventhly, determining user question information with the probability of the category label lower than 0.6 from 9300 pieces of user question information; and sequencing the determined user question information according to the probability of the category label, and selecting 200 pieces of user question information to be added into the target data set according to the sequencing result and the sequence of the probability of the category label from small to large.
And eighthly, aiming at the 200 pieces of user question information in the step seven, 200 manual labeling results are obtained according to the mode described in the step three.
And ninthly, training by adopting a FastText model of the manual labeling result of 400 pieces of user question information in the target data set.
And tenth, testing the current FastText model (namely the FastText model obtained through the training in the ninth step) by adopting the test set according to the mode in the fifth step. If the accuracy obtained by the test is lower than 90%, the sixth step to the tenth step can be repeatedly executed, and if the accuracy obtained by the test reaches 90%, the current FastText model can be determined as the FastText model after the training is finished.
Referring to fig. 8, fig. 8 is a table of accuracy comparison obtained by the above procedure and based on random labeling. When the user question information in the target data set reaches 2000 pieces, the FastText model is trained through the manual labeling result of the 2000 pieces of user question information, the accuracy of the obtained FastText model can reach 91%, and the FastText model meets preset conditions. Training is carried out in a random labeling mode, 9500 manual labeling results are required for training, and the accuracy can reach 91%. Therefore, by the data annotation method provided by the embodiment of the application, the manual annotation can be reduced by 80% under the condition of improving the accuracy, and the development cycle of the service is shortened.
It should be understood that the above examples, while given for a "complaint progress query scenario," the principles and procedures described above apply in other scenarios, such as service score and first complaint scenarios, among others.
Referring to fig. 9, fig. 9 is a block diagram of a data annotation device 110 provided in an embodiment of the present application, where functions implemented by the data annotation device 110 correspond to steps of the data annotation method. The data annotation device 110 may be understood as the data processing apparatus 100, or the processor 130 of the data processing apparatus 100, or may be understood as a component which is independent from the data processing apparatus 100 and implements the functions of the embodiments of the present application under the control of the data processing apparatus 100. As shown in FIG. 9, the data annotation device 110 can include a training module 111 and an automatic annotation module 112.
The training module 111 is configured to perform at least one iteration on a preset classification model, so that the accuracy of the classification model meets a preset condition, and a trained classification model is obtained.
Wherein each iteration process comprises:
inputting other data to be labeled except for a target data set in a plurality of data to be labeled into the classification model respectively to obtain respective classification results of the other data to be labeled; according to the confidence degree of the classification result, at least part of data to be labeled with the confidence degree within a preset range is selected from the other data to be labeled and added into the target data set; and training the classification model according to the manual labeling result of the data to be labeled in the target data set.
The automatic labeling module 112 is configured to process at least a part of the data to be labeled by using the trained classification model to obtain an automatic labeling result.
Optionally, in this embodiment, the classification result may include a category label and a confidence of the category label. In this case, the training module 111 may select, according to the confidence of the classification result, at least a part of the data to be labeled, which has a confidence in a preset range, from the other data to be labeled, and add the selected part of the data to be labeled to the target data set by:
and selecting data to be labeled with the confidence level lower than a preset threshold value from the other data to be labeled, and adding at least part of the selected data to be labeled into the target data set.
Further, the training module 111 adds at least part of the selected data to be labeled to the target dataset by:
sorting the selected data to be labeled according to the confidence degree of the classification result;
and according to the sorting result, sequentially selecting a preset number of pieces of data to be marked from the selected data to be marked according to the sequence of the confidence degrees from small to large, and adding the data to be marked into the target data set.
Optionally, the classification result may include a plurality of class labels and the confidence of each class label, and the sum of the confidence of the class labels is 1. In this case, the training module 111 may select at least part of the data to be labeled, of the other data to be labeled, with the confidence level within a preset range, from the other data to be labeled, and add the selected data to the target data set according to the confidence level of the classification result in the following manner:
and selecting the data to be labeled with a preset classification result from the other data to be labeled, and adding at least part of the selected data to be labeled into the target data set.
Wherein the preset classification result is a classification result in which the confidence of at least one class label is 40% -60%.
Further, the training module 111 may add at least part of the selected data to be labeled to the target data set by:
randomly selecting a preset number of pieces of data to be marked from the selected data to be marked, and adding the data to be marked into the target data set.
Optionally, the training module 111 may be further configured to, during each iteration, obtain a manual labeling result of at least part of the data to be labeled before training the classification model according to the manual labeling result of the data to be labeled in the target data set.
Optionally, in this embodiment, the training module 111 may obtain the manual labeling result of at least part of the data to be labeled by:
aiming at each data to be labeled in at least part of the data to be labeled, acquiring a plurality of category labels of the data to be labeled, which are input by different users;
and selecting the category label with the largest occurrence frequency from the plurality of labels, and adding the category label to the data to be labeled to obtain the manual labeling result of the data to be labeled.
Optionally, in this embodiment, the data annotation device 110 may further include a data storage module 113.
The data storage module 113 is configured to store the manual annotation result of the at least part of the data to be annotated in a search engine supporting a visualization tool; and saving the automatic labeling result to the search engine.
Optionally, in this embodiment, the data annotation device 110 may further include a pre-training module 114.
The pre-training module 114 is configured to determine an empty set as the target data set before the training module 111 is executed; or selecting a part of the plurality of pieces of data to be labeled as a target data set, and training a pre-established classification model according to an artificial labeling result of the data to be labeled in the target data set to obtain the preset classification model.
Optionally, in this embodiment, the training module 111 may be further configured to, during each iteration, train the classification model according to a manual labeling result of data to be labeled in the target data set, and then test the classification model through a preset test set to obtain a test accuracy; and if the test accuracy meets the preset condition, taking the classification model as the trained classification model.
Optionally, the automatic labeling module 112 may be specifically configured to process each piece of data to be labeled by using the trained classification model; or processing other data to be labeled except the target data set in the multiple pieces of data to be labeled by adopting the trained classification model.
The various modules described above may be connected or in communication with each other via a wired or wireless connection. The wired connection may include a metal cable, an optical cable, a hybrid cable, etc., or any combination thereof, among others. The wireless connection may comprise a connection in the form of a LAN, WAN, bluetooth, ZigBee, NFC, or the like, or any combination thereof. Two or more modules may be combined into a single module, and any one module may be divided into two or more units.
The embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the steps of the data annotation method are performed.
To sum up, the data labeling method, the data labeling device, and the data processing apparatus provided in the embodiments of the present application perform the following steps at least once on a classification model, so that the accuracy of the classification model meets a preset condition: inputting other data to be labeled in the data to be labeled except the target data set into a classification model respectively to obtain a classification result; selecting at least part of data to be labeled, of which the confidence degrees of the classification results are in a preset range, from the other data to be labeled, and adding the data to be labeled into a target data set; and training a classification module according to the manual labeling result of the data to be labeled in the target data set. And processing at least one part of the data to be labeled by the classification model to obtain an automatic labeling result. Through the design, automatic labeling of batch data can be realized under the condition of improving data labeling quality.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific work engineering of the apparatus described above may refer to the corresponding process in the method embodiment, and is not described herein again. In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described apparatus embodiments are merely illustrative, and for example, the division of the above-described modules is merely a logical division, and the actual implementation may be achieved by another division, and for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or modules through some communication interfaces, and may be in an electrical, mechanical or other form.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (24)

1. A method for annotating data, the method comprising:
performing at least one iteration treatment on a preset classification model to enable the accuracy of the classification model to meet a preset condition, so as to obtain a trained classification model;
processing at least one part of the multiple pieces of data to be labeled by adopting the trained classification model to obtain an automatic labeling result;
wherein each iteration process comprises:
inputting other data to be labeled except for a target data set in the multiple data to be labeled into the classification model respectively to obtain respective classification results of the other data to be labeled; according to the confidence degree of the classification result, at least part of data to be labeled with the confidence degree within a preset range is selected from the other data to be labeled and added into the target data set; and training the classification model according to the manual labeling result of the data to be labeled in the target data set.
2. The method of claim 1, wherein the classification result comprises a class label and a confidence level of the class label;
selecting at least part of data to be labeled with the confidence degree within a preset range from the other data to be labeled according to the confidence degree of the classification result, and adding the data to be labeled into the target data set, wherein the method comprises the following steps:
and selecting data to be labeled with the confidence level lower than a preset threshold value from the other data to be labeled, and adding at least part of the selected data to be labeled into the target data set.
3. The method of claim 2, wherein adding at least some of the selected data to be labeled to the target dataset comprises:
sorting the selected data to be labeled according to the confidence degree of the classification result;
and according to the sorting result, sequentially selecting a preset number of pieces of data to be marked from the selected data to be marked according to the sequence of the confidence degrees from small to large, and adding the data to be marked into the target data set.
4. The method of claim 1, wherein the classification result comprises a plurality of class labels and a confidence level of each class label, and the sum of the confidence levels of the class labels is 1;
selecting at least part of data to be labeled with the confidence degree within a preset range from the other data to be labeled according to the confidence degree of the classification result, and adding the data to be labeled into the target data set, wherein the method comprises the following steps:
selecting data to be labeled with a preset classification result from the other data to be labeled, and adding at least part of the selected data to be labeled into the target data set; wherein the preset classification result is a classification result in which the confidence of at least one class label is 40% -60%.
5. The method of claim 4, wherein adding at least some of the selected data to be labeled to the target dataset comprises:
randomly selecting a preset number of pieces of data to be marked from the selected data to be marked, and adding the data to be marked into the target data set.
6. The method according to any one of claims 1-5, wherein each iteration process further comprises:
and acquiring the artificial labeling result of at least part of the data to be labeled before training the classification model according to the artificial labeling result of the data to be labeled in the target data set.
7. The method according to claim 6, wherein obtaining the manual annotation result of the at least part of the data to be annotated comprises:
aiming at each data to be labeled in at least part of the data to be labeled, acquiring a plurality of category labels of the data to be labeled, which are input by different users;
and selecting the category label with the largest occurrence frequency from the plurality of labels, and adding the category label to the data to be labeled to obtain the manual labeling result of the data to be labeled.
8. The method of claim 6, further comprising:
storing the manual labeling result of at least part of the data to be labeled into a search engine supporting a visualization tool;
and saving the automatic labeling result into the search engine.
9. The method according to any of claims 1-5, wherein prior to performing the first iterative process on the pre-established classification model, the method further comprises:
determining an empty set as the target data set; alternatively, the first and second electrodes may be,
and selecting a part of the plurality of data to be labeled as a target data set, and training a pre-established classification model according to an artificial labeling result of the data to be labeled in the target data set to obtain the preset classification model.
10. The method according to any one of claims 1-5, wherein each iteration process further comprises:
after training the classification model according to the manual labeling result of the data to be labeled in the target data set, testing the classification model through a preset test set to obtain test accuracy;
and if the test accuracy meets the preset condition, taking the classification model as the trained classification model.
11. The method according to any one of claims 1 to 5, wherein processing at least a part of the plurality of pieces of data to be labeled using the trained classification model comprises:
processing each piece of data to be labeled by adopting the trained classification model; alternatively, the first and second electrodes may be,
and processing other data to be labeled except the target data set in the multiple pieces of data to be labeled by adopting the trained classification model.
12. A data annotation device, said device comprising:
the training module is used for carrying out at least one iteration treatment on a preset classification model so as to enable the accuracy of the classification model to accord with a preset condition and obtain a trained classification model;
the automatic labeling module is used for processing at least one part of the data to be labeled by adopting the trained classification model to obtain an automatic labeling result;
wherein each iteration process comprises:
inputting other data to be labeled except for a target data set in a plurality of data to be labeled into the classification model respectively to obtain respective classification results of the other data to be labeled; according to the confidence degree of the classification result, at least part of data to be labeled with the confidence degree within a preset range is selected from the other data to be labeled and added into the target data set; and training the classification model according to the manual labeling result of the data to be labeled in the target data set.
13. The apparatus of claim 12, wherein the classification result comprises a class label and a confidence level of the class label;
the training module selects at least part of data to be labeled with the confidence degree within a preset range from the other data to be labeled according to the confidence degree of the classification result, and adds the selected data to the target data set in a mode that:
and selecting data to be labeled with the confidence level lower than a preset threshold value from the other data to be labeled, and adding at least part of the selected data to be labeled into the target data set.
14. The apparatus of claim 13, wherein the training module adds at least some of the selected data to be labeled to the target dataset by:
sorting the selected data to be labeled according to the confidence degree of the classification result;
and according to the sorting result, sequentially selecting a preset number of pieces of data to be marked from the selected data to be marked according to the sequence of the confidence degrees from small to large, and adding the data to be marked into the target data set.
15. The apparatus of claim 12, wherein the classification result comprises a plurality of class labels and a confidence level of each class label, and a sum of the confidence levels of the class labels is 1;
the training module selects at least part of data to be labeled with the confidence degree within a preset range from the other data to be labeled according to the confidence degree of the classification result, and adds the selected data to the target data set in a mode that:
selecting data to be labeled with a preset classification result from the other data to be labeled, and adding at least part of the selected data to be labeled into the target data set; wherein the preset classification result is a classification result in which the confidence of at least one class label is 40% -60%.
16. The apparatus of claim 15, wherein the training module adds at least some of the selected data to be labeled to the target dataset by:
randomly selecting a preset number of pieces of data to be marked from the selected data to be marked, and adding the data to be marked into the target data set.
17. The apparatus according to any one of claims 12 to 16, wherein the training module is further configured to, during each iterative process, obtain a manual labeling result of the at least part of the data to be labeled before training the classification model according to the manual labeling result of the data to be labeled in the target data set.
18. The apparatus of claim 17, wherein the training module obtains the manual labeling result of the at least part of the data to be labeled by:
aiming at each data to be labeled in at least part of the data to be labeled, acquiring a plurality of category labels of the data to be labeled, which are input by different users;
and selecting the category label with the largest occurrence frequency from the plurality of labels, and adding the category label to the data to be labeled to obtain the manual labeling result of the data to be labeled.
19. The apparatus of claim 17, further comprising:
the data storage module is used for storing the manual labeling result of at least part of the data to be labeled into a search engine supporting a visualization tool; and saving the automatic labeling result to the search engine.
20. The apparatus according to any one of claims 12-16, further comprising:
a pre-training module for determining an empty set as the target data set before running the training module; or selecting a part of the plurality of pieces of data to be labeled as a target data set, and training a pre-established classification model according to an artificial labeling result of the data to be labeled in the target data set to obtain the preset classification model.
21. The apparatus according to any one of claims 12 to 16, wherein the training module is further configured to, during each iterative process, after training the classification model according to the manual labeling result of the data to be labeled in the target data set, test the classification model through a preset test set to obtain a test accuracy; and if the test accuracy meets the preset condition, taking the classification model as the trained classification model.
22. The apparatus according to any one of claims 12 to 16, wherein the automatic labeling module is specifically configured to process each piece of data to be labeled by using the trained classification model; or processing other data to be labeled except the target data set in the multiple pieces of data to be labeled by adopting the trained classification model.
23. A data processing apparatus, characterized by comprising: a processor, a storage medium and a bus, the storage medium storing machine-readable instructions executable by the data processing apparatus, the processor and the storage medium communicating via the bus when the data processing apparatus is running, the processor executing the machine-readable instructions to perform the steps of the data annotation method according to any one of claims 1 to 11 when executed.
24. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the data annotation method according to any one of claims 1 to 11.
CN201811549912.9A 2018-12-18 2018-12-18 Data labeling method and device and data processing equipment Pending CN111340054A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811549912.9A CN111340054A (en) 2018-12-18 2018-12-18 Data labeling method and device and data processing equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811549912.9A CN111340054A (en) 2018-12-18 2018-12-18 Data labeling method and device and data processing equipment

Publications (1)

Publication Number Publication Date
CN111340054A true CN111340054A (en) 2020-06-26

Family

ID=71185064

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811549912.9A Pending CN111340054A (en) 2018-12-18 2018-12-18 Data labeling method and device and data processing equipment

Country Status (1)

Country Link
CN (1) CN111340054A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111859872A (en) * 2020-07-07 2020-10-30 中国建设银行股份有限公司 Text labeling method and device
CN112036166A (en) * 2020-07-22 2020-12-04 大箴(杭州)科技有限公司 Data labeling method and device, storage medium and computer equipment
CN112183321A (en) * 2020-09-27 2021-01-05 深圳奇迹智慧网络有限公司 Method and device for optimizing machine learning model, computer equipment and storage medium
CN112560988A (en) * 2020-12-25 2021-03-26 竹间智能科技(上海)有限公司 Model training method and device
CN113139072A (en) * 2021-04-20 2021-07-20 苏州挚途科技有限公司 Data labeling method and device and electronic equipment
CN113157170A (en) * 2021-03-25 2021-07-23 北京百度网讯科技有限公司 Data labeling method and device
CN113222547A (en) * 2021-05-17 2021-08-06 北京明略昭辉科技有限公司 Project follow-up method, system, electronic equipment and storage medium
CN113627568A (en) * 2021-08-27 2021-11-09 广州文远知行科技有限公司 Bidding supplementing method, device, equipment and readable storage medium
CN115248831A (en) * 2021-04-28 2022-10-28 马上消费金融股份有限公司 Labeling method, device, system, equipment and readable storage medium
CN115826806A (en) * 2023-02-09 2023-03-21 广东粤港澳大湾区硬科技创新研究院 Auxiliary labeling method, device and system for satellite telemetry data

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9536522B1 (en) * 2013-12-30 2017-01-03 Google Inc. Training a natural language processing model with information retrieval model annotations
CN108009589A (en) * 2017-12-12 2018-05-08 腾讯科技(深圳)有限公司 Sample data processing method, device and computer-readable recording medium
CN108875963A (en) * 2018-06-28 2018-11-23 北京字节跳动网络技术有限公司 Optimization method, device, terminal device and the storage medium of machine learning model
CN108875821A (en) * 2018-06-08 2018-11-23 Oppo广东移动通信有限公司 The training method and device of disaggregated model, mobile terminal, readable storage medium storing program for executing

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9536522B1 (en) * 2013-12-30 2017-01-03 Google Inc. Training a natural language processing model with information retrieval model annotations
CN108009589A (en) * 2017-12-12 2018-05-08 腾讯科技(深圳)有限公司 Sample data processing method, device and computer-readable recording medium
CN108875821A (en) * 2018-06-08 2018-11-23 Oppo广东移动通信有限公司 The training method and device of disaggregated model, mobile terminal, readable storage medium storing program for executing
CN108875963A (en) * 2018-06-28 2018-11-23 北京字节跳动网络技术有限公司 Optimization method, device, terminal device and the storage medium of machine learning model

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111859872A (en) * 2020-07-07 2020-10-30 中国建设银行股份有限公司 Text labeling method and device
CN112036166A (en) * 2020-07-22 2020-12-04 大箴(杭州)科技有限公司 Data labeling method and device, storage medium and computer equipment
CN112183321A (en) * 2020-09-27 2021-01-05 深圳奇迹智慧网络有限公司 Method and device for optimizing machine learning model, computer equipment and storage medium
CN112560988B (en) * 2020-12-25 2023-09-19 竹间智能科技(上海)有限公司 Model training method and device
CN112560988A (en) * 2020-12-25 2021-03-26 竹间智能科技(上海)有限公司 Model training method and device
CN113157170A (en) * 2021-03-25 2021-07-23 北京百度网讯科技有限公司 Data labeling method and device
CN113139072A (en) * 2021-04-20 2021-07-20 苏州挚途科技有限公司 Data labeling method and device and electronic equipment
CN115248831B (en) * 2021-04-28 2024-03-15 马上消费金融股份有限公司 Labeling method, labeling device, labeling system, labeling equipment and readable storage medium
CN115248831A (en) * 2021-04-28 2022-10-28 马上消费金融股份有限公司 Labeling method, device, system, equipment and readable storage medium
CN113222547A (en) * 2021-05-17 2021-08-06 北京明略昭辉科技有限公司 Project follow-up method, system, electronic equipment and storage medium
CN113627568A (en) * 2021-08-27 2021-11-09 广州文远知行科技有限公司 Bidding supplementing method, device, equipment and readable storage medium
CN115826806B (en) * 2023-02-09 2023-06-02 广东粤港澳大湾区硬科技创新研究院 Auxiliary labeling method, device and system for satellite telemetry data
CN115826806A (en) * 2023-02-09 2023-03-21 广东粤港澳大湾区硬科技创新研究院 Auxiliary labeling method, device and system for satellite telemetry data

Similar Documents

Publication Publication Date Title
CN111340054A (en) Data labeling method and device and data processing equipment
US20210342371A1 (en) Method and Apparatus for Processing Knowledge Graph
EP4040401A1 (en) Image processing method and apparatus, device and storage medium
CN112579727B (en) Document content extraction method and device, electronic equipment and storage medium
CN110674312B (en) Method, device and medium for constructing knowledge graph and electronic equipment
CN112163424A (en) Data labeling method, device, equipment and medium
US20200364216A1 (en) Method, apparatus and storage medium for updating model parameter
US20230045330A1 (en) Multi-term query subsumption for document classification
CN112667805A (en) Work order category determination method, device, equipment and medium
CN111144109B (en) Text similarity determination method and device
CN113780098A (en) Character recognition method, character recognition device, electronic equipment and storage medium
CN111369294B (en) Software cost estimation method and device
CN113361240B (en) Method, apparatus, device and readable storage medium for generating target article
CN117077679B (en) Named entity recognition method and device
CN110472063A (en) Social media data processing method, model training method and relevant apparatus
CN111597336B (en) Training text processing method and device, electronic equipment and readable storage medium
CN111709475A (en) Multi-label classification method and device based on N-grams
CN108830302B (en) Image classification method, training method, classification prediction method and related device
CN113591881B (en) Intention recognition method and device based on model fusion, electronic equipment and medium
CN113704519B (en) Data set determining method and device, computer equipment and storage medium
CN112925913B (en) Method, apparatus, device and computer readable storage medium for matching data
US10007593B2 (en) Injection of data into a software application
CN114443493A (en) Test case generation method and device, electronic equipment and storage medium
CN113032443A (en) Method, apparatus, device and computer-readable storage medium for processing data
RU2549118C2 (en) Iterative filling of electronic glossary

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination