CN110543891B - Data labeling method, device, system and storage medium - Google Patents

Data labeling method, device, system and storage medium Download PDF

Info

Publication number
CN110543891B
CN110543891B CN201910668500.5A CN201910668500A CN110543891B CN 110543891 B CN110543891 B CN 110543891B CN 201910668500 A CN201910668500 A CN 201910668500A CN 110543891 B CN110543891 B CN 110543891B
Authority
CN
China
Prior art keywords
classification
network
layer
data
result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910668500.5A
Other languages
Chinese (zh)
Other versions
CN110543891A (en
Inventor
程洲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Baidu China Co Ltd
Original Assignee
Baidu China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Baidu China Co Ltd filed Critical Baidu China Co Ltd
Priority to CN201910668500.5A priority Critical patent/CN110543891B/en
Publication of CN110543891A publication Critical patent/CN110543891A/en
Application granted granted Critical
Publication of CN110543891B publication Critical patent/CN110543891B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention provides a data labeling method, a device, a system and a storage medium, wherein the method comprises the following steps: acquiring data characteristics of data to be marked; distributing the data characteristics to various hierarchical classification networks; obtaining classification results of different layers through the classification network according to the data characteristics; and taking the classification result of the classification network as a data labeling result. The method is suitable for data labeling scenes with a plurality of labels, can effectively reduce the labor cost input, and improves the data labeling efficiency and the data labeling quality.

Description

Data labeling method, device, system and storage medium
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a data annotation method, apparatus, system, and storage medium.
Background
With the development of computer technology, the data processing capacity of the artificial intelligence system is also stronger and stronger.
At present, most artificial intelligence systems are constructed based on neural networks. Most neural network systems do not depart from the use of large amounts of labeled data. Of these annotation data, a large part of the data belongs to classification annotation data that classifies the input. Therefore, acquiring the classification labeling data becomes a first task of many artificial intelligence projects. And the quality of the classified marking data directly determines the analysis capability of the artificial intelligence system. In the prior art, the classification and marking of the original data generally directly allows a labeling person to select a correct label from candidate categories.
However, this method is only suitable for the situation that the number of candidate tags is small, and when the number of tags is large, the labeling efficiency and the labeling accuracy of this method are greatly reduced, thereby affecting the quality of the labeled data.
Disclosure of Invention
The invention provides a data labeling method, a data labeling device, a data labeling system and a storage medium, which are suitable for data labeling scenes of a plurality of labels, can effectively reduce the labor cost input, and improve the data labeling efficiency and the data labeling quality.
In a first aspect, an embodiment of the present invention provides a data annotation method, including:
acquiring data characteristics of data to be marked;
distributing the data characteristics to various hierarchical classification networks;
obtaining classification results of different layers through the classification network according to the data characteristics;
and taking the classification result of the classification network as a data labeling result.
In one possible design, after obtaining classification results of different hierarchies through the classification network according to the data features, the method further includes:
the classification results of all the hierarchical classification networks are sequentially examined from the classification result of the first-layer classification network;
and if all the layered classification results pass the examination and verification, obtaining the classification result of the last layer of classification network.
In a possible design, the sequentially reviewing the classification results of the classification networks of the respective hierarchical layers starting from the classification result of the first-layer classification network includes:
judging whether the classification result of the current layer classification network passes the audit or not;
if the classification result of the current layer of classification network passes the verification, starting the verification of the classification result of the next layer of classification network;
if the classification result of the current-layer classification network does not pass the verification, judging whether the classification result of the current-layer classification network belongs to a preset candidate label set;
if the label belongs to the preset candidate label set, selecting a correct classification label from the preset candidate label set as a classification result;
and if the label belongs to the preset candidate label set, determining that the classification result does not belong to the category. In one possible design, further comprising:
and if the classification result of the current layer of classification network does not belong to the category, feeding back the data characteristics to the previous layer of classification network so as to carry out iterative training on the previous layer of classification network until the previous layer of classification network outputs a correct classification result.
In a possible design, the sequentially reviewing the classification results of the classification networks of the respective hierarchies starting from the classification result of the first hierarchy of classification networks further includes:
obtaining the pre-estimated accuracy score of the classification result of the current layer classification network; wherein the pre-estimated accuracy score is positively correlated with the number of times that the classification result is correct;
and if the estimated accuracy score of the classification result of the current-layer classification network is greater than a preset threshold value, directly skipping the examination and verification of the classification result of the current-layer classification network.
In a second aspect, an embodiment of the present invention provides a data annotation device, including:
the extraction module is used for acquiring data characteristics of the data to be marked;
the distribution module is used for distributing the data characteristics to the classification networks of all the layers;
the classification module is used for acquiring classification results of different layers through the classification network according to the data characteristics;
and the output module is used for taking the classification result of the classification network as a data annotation result.
In one possible design, further comprising: an audit module to:
the classification results of all the hierarchical classification networks are sequentially examined from the classification result of the first-layer classification network;
and if all the layered classification results pass the examination and verification, obtaining the classification result of the last layer of classification network.
In one possible design, the audit module is further configured to:
judging whether the classification result of the current layer classification network passes the audit or not;
if the classification result of the current layer of classification network passes the verification, starting the verification of the classification result of the next layer of classification network;
if the classification result of the current-layer classification network does not pass the verification, judging whether the classification result of the current-layer classification network belongs to a preset candidate label set;
if the label belongs to the preset candidate label set, selecting a correct classification label from the preset candidate label set as a classification result;
and if the label belongs to the preset candidate label set, determining that the classification result does not belong to the category.
In one possible design, further comprising: a feedback module to:
and if the classification result of the current layer of classification network does not belong to the category, feeding back the data characteristics to the previous layer of classification network so as to carry out iterative training on the previous layer of classification network until the previous layer of classification network outputs a correct classification result.
In one possible design, the auditing module is further configured to: obtaining the pre-estimated accuracy score of the classification result of the current layer classification network; wherein the pre-estimated accuracy score is positively correlated with the number of times that the classification result is correct;
and if the estimated accuracy score of the classification result of the current-layer classification network is greater than a preset threshold value, directly skipping the examination and verification of the classification result of the current-layer classification network.
In a third aspect, the present invention provides a data annotation system, including: a processor and a memory; the memory stores executable instructions of the processor; wherein the processor is configured to perform the data annotation method of any one of the first aspects via execution of the executable instructions.
In a fourth aspect, the present invention provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor, implements the data annotation method of any one of the first aspects.
In a fifth aspect, an embodiment of the present invention provides a program product, where the program product includes: a computer program stored in a readable storage medium, from which the computer program can be read by at least one processor of a server, execution of the computer program by the at least one processor causing the server to perform the data annotation method of any one of the first aspects.
The invention provides a data labeling method, a device, a system and a storage medium, which are characterized in that the data characteristics of data to be labeled are obtained; distributing the data characteristics to various hierarchical classification networks; obtaining classification results of different layers through the classification network according to the data characteristics; and taking the classification result of the classification network as a data labeling result. The method is suitable for data labeling scenes with a plurality of labels, can effectively reduce the labor cost input, and improves the data labeling efficiency and the data labeling quality.
Drawings
FIG. 1 is a schematic diagram of an application scenario of the present invention;
FIG. 2 is a flowchart of a data annotation method according to an embodiment of the present invention;
FIG. 3 is a flowchart of a data annotation method according to a second embodiment of the present invention;
FIG. 4 is a schematic diagram of a hierarchical tag network of an animal scene according to an embodiment of the present invention;
fig. 5 is a schematic flowchart of classification result auditing according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a data annotation device according to a third embodiment of the present invention;
fig. 7 is a schematic structural diagram of a data annotation device according to a fourth embodiment of the present invention;
fig. 8 is a schematic structural diagram of a data annotation system according to a fifth embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Moreover, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The technical solution of the present invention will be described in detail below with specific examples. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments.
With the development of computer technology, the data processing capacity of the artificial intelligence system is also stronger and stronger. At present, most artificial intelligence systems are constructed based on neural networks. Most neural network systems do not depart from the use of large amounts of labeled data. Of these annotation data, a large part of the data belongs to classification annotation data for classifying input, such as classification of pictures, classification of text emotion, speech dialect, and the like. In addition, many derived artificial intelligence systems do not depart from the use of the classification data in the previous period, for example, although neither image segmentation nor object detection systems directly rely on the image classification data set, the neural network pre-training models used therein are trained based on the classified persons, and similarly, many artificial intelligence systems based on the neural network are used. Tagging classified data is therefore a top job for many artificial intelligence projects, and the quality of tagged data becomes the ceiling from which artificial intelligence systems are determined. And the quality of the classified marking data directly determines the analysis capability of the artificial intelligence system. At present, the scheme of classifying and marking the original data is to directly let the marking personnel select the correct label from the candidate categories. When the number of candidate labels is relatively small (such as 5), the efficiency and the labeling quality of a labeling person are acceptable, but when the number of the labels is very large (such as 20000 types), the efficiency and the labeling quality of the labeling person are difficult to guarantee at the same time. For example, the problem of classification and labeling of data with an excessively large number of categories, such as classification and labeling of 1 ten thousand commodities or classification and labeling of 2 ten thousand animals and plants, is solved. The method is only suitable for the condition that the number of the candidate labels is less, and when the number of the labels is more, the labeling efficiency and the labeling accuracy of the method are greatly reduced, so that the quality of the labeled data is influenced.
In order to solve the technical problems, the invention provides a data labeling method, a data labeling device, a data labeling system and a storage medium, which are suitable for data labeling scenes with a plurality of labels, can effectively reduce the labor cost input, and improve the data labeling efficiency and the data labeling quality. Fig. 1 is a schematic diagram of an application scenario of the present invention, as shown in fig. 1, and different from the conventional classification labeling scheme, the conventional classification labeling system usually directly provides all candidate categories or only provides candidate label sets guessed by the pre-trained labeling system, and has no hierarchical relationship and is relatively simple and straightforward. In the technical scheme, aiming at the labeling tasks of super-classes, firstly, the classes are processed hierarchically according to the characteristics of the labeling data. For a specific scene, for example, 2 ten thousand plant labeling tasks and 1w animal labeling tasks, take an animal scene as an example: firstly, according to the collected data and the target category to be labeled, an expert is requested to combine with animal taxonomy to generate a label hierarchical network by the labels according to factors such as taxonomy, appearance and the like. Then, in each layer of the classification network, a high-precision simple classifier can be trained. The simple classifier has fewer classifications, such as two-classification, five-classification and ten-classification, so that the required labeling data is less than that of a multi-classification system directly trained. Firstly, the data characteristics of the data to be marked are obtained through a characteristic extraction network. The data features are then distributed to the various hierarchical classification networks. The data features are firstly input into the top-level classification sub-network, and the classification sub-network outputs a classification guess result a. And determining the enabled next-layer classification network according to the classification result of the previous-layer classification network, and then distributing the data characteristics to the next-layer classification network. Repeating the above steps can distribute the data characteristics to each hierarchical classification network and output the classification result of each hierarchy. And finally, taking the classification result of the last layer of classification network as a data labeling result. Different from the traditional labeling system adopting a multi-user direct labeling scheme, the technology carries out layering processing on the labels, so that the classification result output by the classification network can be checked at each level node. Whether the classification result is correct or not can be determined through the examination of the classification result. If the audit is passed, starting the next layer of classification network according to the classification result. If the verification is not passed, judging whether the classification result belongs to a preset candidate tag set; if the label belongs to the preset candidate label set, selecting a correct classification label from the preset candidate label set as a classification result; and if the classification result does not belong to the preset candidate label set, determining that the classification result does not belong to the category. And the data characteristics of the classification result which do not belong to the category can be fed back to the upper layer of classification network so as to carry out iterative training on the upper layer of classification network until the upper layer of classification network outputs a correct classification result. And finally, taking the classification result of the last layer of classification network as a data annotation result. For example, in the classification label of 2 ten thousand animals and plants, the input data features are the data features of red-source chickens. Firstly, inputting the data characteristics of the red-source chickens into a top-level classification sub-network, namely a two-classification network for distinguishing animals from plants, and obtaining a classification result of the animals. And then, according to the result of the classification network of the layer, distributing the data characteristics of the red-source chickens to a classification sub-network of the next layer, namely a two-classification sub-network for distinguishing vertebrates and invertebrates, and obtaining the classification result as the vertebrates. And then, according to the result of the classification network of the layer, distributing the data characteristics of the red-source chickens to a classification subnetwork of the next layer, namely a five-classification subnetwork for distinguishing fishes, amphibians, reptiles, birds and mammals, and obtaining the classification result of the birds. And repeating the steps, and determining the enabled next-layer classification network according to the classification result of the previous-layer classification network until the data characteristics of the red-source chickens are distributed to the classification sub-network of the last layer, namely, the classification sub-network of the red-source chickens, the pheasants and the gray-breast-foot chickens are classified under the phasianidae, so as to obtain the classification result of the red-source chickens. And taking the red-source chickens of the classification results of the last layer of classification network as data labeling results.
By applying the method, a hierarchical label network of data can be constructed according to the characteristics of the labeled data, and then the classification networks are arranged in each hierarchy of the hierarchical label network, so that the labor cost required to be invested in a super-multi-classification labeling task can be effectively reduced, and the labeling efficiency and the labeling quality of labeling personnel can be improved by fusing various technical points such as human-computer cooperation, computer assistance, human-computer interface improvement and the like. According to the method, the demand of high-cost professionals can be effectively reduced through a simple to complex hierarchical labeling method, and the current situation that the existing labeling method depends on the professional labeling personnel in an isolated mode is improved. In addition, the labeling process is improved by the characteristic that a simple neural network can be trained to perform pre-classification through a small amount of data, the labeling difficulty is reduced, and the method is suitable for accurate classification of large-scale labels.
The following describes the technical solutions of the present invention and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present invention will be described below with reference to the accompanying drawings.
Fig. 2 is a flowchart of a data annotation method according to an embodiment of the present invention, and as shown in fig. 2, the method in this embodiment may include:
s101, acquiring data characteristics of data to be marked.
In the embodiment, the key point of the invention is that the internal relation among the data labels is weighted, and the hierarchical label network is constructed by fully utilizing the hierarchical information contained in the labels. Then, in each layer of the classification network, a high-precision simple classifier can be trained. The simple classifier has fewer classifications, such as two-classification, five-classification and ten-classification, so that the required labeling data is less than that of a multi-classification system directly trained. Therefore, the data needing to be marked are judged by training a plurality of neural networks in an iterative manner, the marked data are recovered and retrained, and the model in the marking system is updated. Most of the rough classification does not need professionals, the fine classification urgently needs professionals in related fields, cost difference between the labeling personnel in the professional field and the labeling personnel in the non-professional field is obvious, the requirement amount of the professionals is reduced through a layering decoupling labeling process, and the non-professionals are fully utilized. For example, in the classified plants, no professional is required to classify trees, shrubs, fleshy fruits, flowers and plants, and the professional requirements of each sub-order of the locust tree in the classified trees are high. Therefore, the judgment of yes/no by professional marking personnel is more convenient and easier in operation than the judgment of the marking personnel belonging to a specific category. Meanwhile, the neural network system trained by a small amount of data can make the accuracy exceeding random selection in simple classification, and the annotating personnel only need to judge whether the accuracy is correct or incorrect, so that the annotating personnel do not need to label the simple data completely and directly, the number of times of labeling the simple data by the annotating personnel is reduced, the amount of labeling tasks is reduced, the neural network is given full play in simple tasks, a large amount of simple labeled sample pairs are screened by relying on a small amount of data, and only complex samples are reserved for the annotating personnel as far as possible. In the data annotation process, distribution-annotation-recovery-training-distribution is the technical core of the invention. Therefore, the data features of the data to be marked are firstly acquired through the feature extraction network.
And S102, distributing the data characteristics to each layered classification network.
In this embodiment, the data features are entered into the top-level classification subnetwork, and the classification subnetwork outputs a classification guess result a. The enabled next-layer classification network can be determined according to the classification result of the previous-layer classification network, and then the data characteristics are distributed to the next-layer classification network. Repeating the above steps can distribute the data characteristics to each hierarchical classification network and output the classification result of each hierarchy.
S103, obtaining classification results of different layers through a classification network according to the data characteristics.
In this embodiment, according to the data characteristics, the classification result is obtained through each classification network, so that classification results of different layers are obtained. Wherein, the classification result of the upper layer classification network determines the enabled next layer classification network.
And S104, taking the classification result of the classification network as a data labeling result.
In this embodiment, the classification result of the last layer of classification network may be used as the data annotation result. And the classification results of all the hierarchies can be recorded at the same time as the data annotation result. For example, in the classification label of 2 ten thousand animals and plants, the input data features are the data features of red-source chickens. Firstly, inputting the data characteristics of the red-source chickens into a top-level classification sub-network, namely a two-classification network for distinguishing animals and plants, and obtaining a classification result as animals. And then, according to the result of the classification network of the layer, distributing the data characteristics of the red-source chickens to a classification sub-network of the next layer, namely a two-classification sub-network for distinguishing vertebrates and invertebrates, and obtaining the classification result as the vertebrates. And then, according to the result of the classification network of the layer, distributing the data characteristics of the red-source chickens to the classification subnetworks of the next layer, namely five classification subnetworks for distinguishing fishes, amphibians, reptiles, birds and mammals, and obtaining the classification result as birds. Repeating the steps, and determining the enabled next-layer classification network according to the classification result of the previous-layer classification network until the data characteristics of the red-source chickens are distributed to the classification sub-network of the last layer, namely, the classification sub-network of the red-source chickens, the pheasants and the gray-breast foot chickens are classified under the phasianidae, so as to obtain the classification result of the red-source chickens. And taking the red-source chicken of the classification result of the last layer of classification network as a data annotation result.
Further optimizing the process: initially, a simple neural network model is adopted for a feature network and a classification sub-network in the system; and gradually replacing the simple network model into a more complex neural network model as the labeled data is accumulated. The optimization process is based on: the complex model needs a large amount of labeled data for obtaining a good effect, on the simple classification task, the simple model can obtain a good effect under the condition of relatively less data, along with data accumulation, the simple model reaches the effect bottleneck, and then the more complex model is replaced to improve the correctness of the estimated classification label. Further optimizing the process: by setting a threshold value for the probability score estimated by classification, when the score exceeds the threshold value, the score is automatically distributed to subclasses without confirmation of labeling personnel of the subclasses, and the subclasses directly enter a next-layer subclass system.
Different from the traditional labeling system, the auxiliary labeling system can obtain not only the final classification label of the original data but also the hierarchical classification label of the original data after completing the labeling process, and the hierarchical label can still exert value as a data byproduct. Meanwhile, a good hierarchical classification model can be obtained and directly used by continuously upgrading the feature network model and the subclass classification model used in the auxiliary labeling system.
It should be noted that the present invention is not limited to the specific use and replacement process of the feature network, and the present invention is not limited to the illustrated plant classification data labeling and animal classification data labeling, but can be applied to all cases with more categories and capable of constructing hierarchical relationships. In addition, aiming at the actual labeling times of the labeling personnel, if a threshold value optimization scheme is not adopted, the total labeling times are slightly increased, the system improves the labeling efficiency by means of the prediction result of the trained classification model in the labeling process, reduces the difficulty of single labeling and improves the traditional labeling interactive interface, and further, the optimization aiming at the common labeling system on the whole is realized.
In the embodiment, the data characteristics of the data to be marked are obtained; distributing the data characteristics to each hierarchical classification network; obtaining classification results of different layers through a classification network according to data characteristics; and taking the classification result of the classification network as a data annotation result. The method is suitable for data labeling scenes with a plurality of labels, can effectively reduce the labor cost input, and improves the data labeling efficiency and the data labeling quality.
Fig. 3 is a flowchart of a data annotation method provided in the second embodiment of the present invention, and as shown in fig. 3, the method in the second embodiment may include:
s201, constructing a hierarchical label network according to data to be marked; the hierarchical label network comprises: a feature extraction network and respective hierarchical classification networks.
In this embodiment, unlike the conventional classification labeling scheme, the conventional classification labeling system usually directly provides all candidate categories or only provides candidate label sets guessed by the pre-trained labeling system, and the conventional classification labeling system has no hierarchical relationship and is relatively simple and straightforward. In the technical scheme, aiming at the labeling tasks of more than one category, firstly, the category is hierarchically processed according to the characteristics of the labeling data. For a specific scene, for example, 2 ten thousand plant labeling tasks and 1w animal labeling tasks, take an animal scene as an example: firstly, according to the collected data and the target category to be labeled, an expert is requested to combine with animal taxonomy to generate a label hierarchical network by the labels according to factors such as taxonomy, appearance and the like. Fig. 4 is a schematic view of a hierarchical tag network of an animal scene provided in an embodiment of the present invention, as shown in fig. 4, taking animal classification as an example, animals may be classified into vertebrates and invertebrates, and child nodes of a vertebrate include: fish, amphibians, reptiles, birds, mammals, and the like; invertebrate child nodes comprising: coelenterates, molluscs, arthropods; the children of birds include: wild goose-shaped eye, chicken-shaped eye, crane-shaped eye, gull-shaped eye; the chicken-mesh sub-nodes include: turkey, grouse, Phasianidae, Phoenix coronariae; the sub-nodes of Phasianidae include: red source chicken, pheasant, and chicken with gray breast and feet.
It should be noted that, in the process of constructing a hierarchical relationship of labels, an original classification hierarchical relationship of labels (for example, animal and plant taxonomy) is a reference for constructing the hierarchical relationship, and since a classification system based on a neural network, especially an image classification system, relies on more pictures, the basis in the original taxonomy cannot be judged according to images alone, and therefore, some categories in the taxonomy should be reasonably combined or classified into other categories so as to be judged and classified better. Nevertheless, animal and plant taxonomy is still a good existing method to construct a hierarchy of classification labels. And a large category set (such as 10 categories of commodities) in other scenes can similarly construct a multi-level hierarchical relationship diagram.
The hierarchical tag network may be designed manually or may be automatically generated by the system, for example, the hierarchical tag network of similar data may be automatically generated based on a conventional manually designed network.
S202, acquiring data characteristics of the data to be marked.
And S203, distributing the data characteristics to each layered classification network.
And S204, obtaining classification results of different layers through a classification network according to the data characteristics.
S205, sequentially auditing the classification results of the classification networks of all the layers from the classification result of the first layer of classification network; and if all the layered classification results pass the examination and verification, obtaining the classification result of the last layer of classification network.
In this embodiment, the classification result output by the classification network may be audited, and whether the classification result is correct or not may be determined by auditing the classification result. If the audit is passed, starting the next layer of classification network according to the classification result.
Optionally, judging whether the classification result of the current layer classification network passes the audit; if the classification result of the current layer of classification network passes the verification, starting the verification of the classification result of the next layer of classification network; if the classification result of the current-layer classification network does not pass the examination, judging whether the classification result of the current-layer classification network belongs to a preset candidate label set; if the label belongs to the preset candidate label set, selecting a correct classification label from the preset candidate label set as a classification result; and if the classification result does not belong to the preset candidate label set, determining that the classification result does not belong to the category.
Specifically, different from the scheme that a traditional labeling system adopts direct labeling by multiple persons, the technology needs one labeling person at each hierarchical node because the label is subjected to hierarchical processing, and certainly, one labeling person can be responsible for multiple nodes in the system, for example, when the workload under the multiple nodes is less and the person has corresponding capacity in the field of related nodes, the person can be responsible for the multiple nodes by distribution; the nodes can also be labeled by a plurality of people, for example, some nodes have large data volume but simple tasks, such as rough fish-bird-mammal judgment, but along with accumulation of labeled data volume, the workload of such labeling tasks is reduced because the labeling system automatically judges the correct category, so that the workload is reduced, and the labeling personnel only need to pay attention to a small part of misclassification conditions. Besides the mapping relation between the manpower and the nodes, the degree of professional skills and division of labor are also considered points of the system. Because of the possible variation in the expertise of each individual, the cost of developing or recruiting individuals with specialized depth in many areas is significantly higher than those who are skilled in the art. Therefore, in the hierarchical labeling system, a person skilled in a certain aspect is allocated to a specific node, and the person is not required to confirm the category of the picture which is mistakenly placed under the category which the person is skilled in, and only the person needs to feed back that the picture does not belong to the category, and the technical scheme is designed based on the following common cognition: 1. forcing the practitioner to confirm something outside the expertise makes the sample annotation very difficult and likely gives false results that reduce the quality of the annotation, 2. confirming that the category "does not belong to" is much easier for the practitioner, and the system automatically drops to other possible categories with high probability based on feedback to ask the practitioner to confirm, a process that is almost time-consuming for the procedure. In addition, due to the hierarchical relationship, the annotation personnel invested at one time do not need to label all the hierarchical nodes, and the arrangement of the annotation personnel can be completed by only using a small amount of manpower to finish the interactive annotation process layer by layer according to the topological relationship of the hierarchies. Therefore, although the auxiliary labeling technology has a plurality of nodes according to the categories, the manpower can be more flexibly arranged and scheduled.
In an optional embodiment, if the classification result of the current-layer classification network does not belong to the category, the data characteristics are fed back to the previous-layer classification network so as to perform iterative training on the previous-layer classification network until the previous-layer classification network outputs a correct classification result. Specifically, fig. 5 is a schematic flow chart of classification result auditing according to an embodiment of the present invention, and as shown in fig. 5, after the auditing is introduced, the whole labeling flow slightly changes. And (4) verifying the classification result by a labeling person, and if the classification result is correct, only quickly confirming the pass by a shortcut key Enter (a certain shortcut key) and the like. If the classification result is incorrect, judging whether the labeling data is in the candidate category, if so, selecting the candidate category to finish labeling, and if not, selecting the category which does not belong to the series. The automatic auxiliary labeling system collects the labeling selection of the labeling personnel, and the data (except for the data which do not belong to the series) are used for training the classification sub-network so as to improve the estimated accuracy of the next classification. In addition, the auxiliary system also distributes the data to the next-layer classification sub-network according to the labeling result according to the labeling decision of the user. If the user selects 'not belonging to the series', the marking system returns the data to the upper-layer classification subsystem, and the marking personnel corresponding to the upper-layer subsystem reconfirms the category or special treatment to which the data belongs. The invention can reduce the labor cost required to be invested in a super-multi-classification labeling task, improve the labeling efficiency and the labeling quality of labeling personnel by fusing various technical points such as human-computer cooperation, computer assistance, human-computer interface improvement and the like, effectively reduce the demand of high-cost professionals by a simple to complex hierarchical labeling method, and further improve the current situation that the existing labeling method is independent of the professional labeling personnel.
In another optional implementation manner, the pre-estimated accuracy score of the classification result of the current-layer classification network may be obtained; wherein the estimated accuracy score is positively correlated with the correct times of the classification result; and if the estimated accuracy score of the classification result of the current-layer classification network is greater than a preset threshold value, directly skipping the examination and verification of the classification result of the current-layer classification network.
In this embodiment, the more times the classification result is correct, the higher the estimated accuracy is. By adopting the mode in the embodiment, the auditing process can be accelerated, the number of auditors can be reduced, and the auditing efficiency can be improved.
S206, taking the classification result of the classification network as a data labeling result.
In this embodiment, please refer to the related description in steps S101 to S104 in the method shown in fig. 2 for the specific implementation process and technical principle of steps S202 to S204 and S206, which are not described herein again.
In the embodiment, the data characteristics of the data to be marked are obtained through a characteristic extraction network; distributing the data characteristics to each hierarchical classification network; obtaining classification results of different layers through a classification network according to data characteristics; wherein, the classification result of the upper layer of classification network determines the enabled next layer of classification network; and taking the classification result of the last layer of classification network as a data annotation result. The method is suitable for data labeling scenes with a plurality of labels, can effectively reduce the labor cost input, and improves the data labeling efficiency and the data labeling quality.
In addition, the embodiment can also construct a hierarchical label network according to the data to be marked; the hierarchical label network comprises: a feature extraction network and respective hierarchical classification networks. The embodiment can also audit the classification result output by the classification network and process according to the audit result. Therefore, the labor cost investment can be effectively reduced, and the data labeling efficiency and the data labeling quality are improved.
Fig. 6 is a schematic structural diagram of a data annotation device according to a third embodiment of the present invention, and as shown in fig. 6, the data annotation device in this embodiment may include:
the extraction module 31 is configured to obtain data characteristics of data to be marked;
a distribution module 32 for distributing the data characteristics to the respective hierarchical classification networks;
a classification module 33, configured to obtain classification results of different layers through a classification network according to data characteristics;
and the output module 34 is used for taking the classification result of the classification network as a data labeling result.
The data labeling device of this embodiment may execute the technical solution in the method shown in fig. 2, and for the specific implementation process and the technical principle, reference is made to the relevant description in the method shown in fig. 2, which is not described herein again.
In the embodiment, the data characteristics of the data to be marked are obtained; distributing the data characteristics to each hierarchical classification network; obtaining classification results of different layers through a classification network according to data characteristics; and taking the classification result of the classification network as a data labeling result. The method is suitable for data labeling scenes with a plurality of labels, can effectively reduce the labor cost input, and improves the data labeling efficiency and the data labeling quality.
Fig. 7 is a schematic structural diagram of a data annotation device according to a fourth embodiment of the present invention, as shown in fig. 7, the data annotation device according to the present embodiment may further include, on the basis of the device shown in fig. 6:
the building module 35 is configured to build a hierarchical tag network according to the data to be marked; the hierarchical tag network comprises: a feature extraction network and respective hierarchical classification networks.
In one possible design, further comprising: an auditing module 36 for:
the classification results of all the hierarchical classification networks are sequentially examined from the classification result of the first-layer classification network;
and if all the layered classification results pass the examination and verification, obtaining the classification result of the last layer of classification network.
In one possible design, further comprising: the auditing module 36 is further configured to:
judging whether the classification result of the current layer classification network passes the audit or not;
if the classification result of the current layer of classification network passes the examination, starting the examination of the classification result of the next layer of classification network;
if the classification result of the current-layer classification network does not pass the verification, judging whether the classification result of the current-layer classification network belongs to a preset candidate label set;
if the label belongs to the preset candidate label set, selecting a correct classification label from the preset candidate label set as a classification result;
and if the classification result does not belong to the preset candidate label set, determining that the classification result does not belong to the category.
In one possible design, further comprising: a feedback module 37 for:
and if the classification result of the current layer of classification network does not belong to the category, feeding back the data characteristics to the previous layer of classification network so as to carry out iterative training on the previous layer of classification network until the previous layer of classification network outputs a correct classification result.
In one possible design, the audit module 36 is further configured to:
obtaining the pre-estimated accuracy score of the classification result of the current layer classification network; wherein the estimated accuracy score is positively correlated with the correct times of the classification result;
and if the estimated accuracy score of the classification result of the current-layer classification network is greater than a preset threshold value, directly skipping the examination and verification of the classification result of the current-layer classification network.
The data annotation device in this embodiment may execute the technical solutions in the methods shown in fig. 2 and fig. 3, and the specific implementation process and technical principle of the data annotation device refer to the relevant descriptions in the methods shown in fig. 2 and fig. 3, which are not described herein again.
In the embodiment, the data characteristics of the data to be marked are obtained through a characteristic extraction network; distributing the data characteristics to each hierarchical classification network; obtaining classification results of different layers through a classification network according to data characteristics; wherein, the classification result of the upper layer of classification network determines the enabled next layer of classification network; and taking the classification result of the last layer of classification network as a data annotation result. The method is suitable for data labeling scenes with a plurality of labels, can effectively reduce the labor cost input, and improves the data labeling efficiency and the data labeling quality.
In addition, according to the embodiment, a hierarchical tag network can be constructed according to the data to be marked; the hierarchical label network comprises: a feature extraction network and respective hierarchical classification networks. The embodiment can also audit the classification result output by the classification network and process according to the audit result. Therefore, the labor cost input can be effectively reduced, and the data labeling efficiency and the data labeling quality are improved.
Fig. 8 is a schematic structural diagram of a data annotation system according to a fifth embodiment of the present invention, and as shown in fig. 8, the data annotation system 40 according to this embodiment may include: a processor 41 and a memory 42.
A memory 42 for storing programs; the Memory 42 may include a volatile Memory (RAM), such as a Static Random Access Memory (SRAM), a Double Data Rate Synchronous Dynamic Random Access Memory (DDR SDRAM), and the like; the memory may also comprise a non-volatile memory, such as a flash memory. The memory 42 is used to store computer programs (e.g., applications, functional modules, etc. that implement the above-described methods), computer instructions, etc., which may be stored in partitions in the one or more memories 42. And the above-mentioned computer program, computer instructions, data, etc. can be called by the processor 41.
The computer programs, computer instructions, etc. described above may be stored in one or more memories 42 in partitions. And the above-mentioned computer program, computer instruction, etc. can be called by the processor 41.
A processor 41 for executing the computer program stored in the memory 42 to implement the steps of the method according to the above embodiments.
Reference may be made in particular to the description relating to the preceding method embodiment.
The processor 41 and the memory 42 may be separate structures or may be integrated structures integrated together. When the processor 41 and the memory 42 are separate structures, the memory 42 and the processor 41 may be coupled by a bus 43.
In the embodiment, the data characteristics of the data to be marked are obtained through a characteristic extraction network; distributing the data characteristics to each hierarchical classification network; obtaining classification results of different layers through a classification network according to data characteristics; wherein, the classification result of the upper layer of classification network determines the enabled next layer of classification network; and taking the classification result of the last layer of classification network as a data annotation result. The method is suitable for data labeling scenes with a plurality of labels, can effectively reduce the labor cost input, and improves the data labeling efficiency and the data labeling quality.
The data annotation system of this embodiment may execute the technical solutions in the methods shown in fig. 2 and fig. 3, and for the specific implementation process and technical principle, reference is made to the relevant descriptions in the methods shown in fig. 2 and fig. 3, which are not described herein again.
In addition, embodiments of the present application further provide a computer-readable storage medium, in which computer-executable instructions are stored, and when at least one processor of the user equipment executes the computer-executable instructions, the user equipment performs the above-mentioned various possible methods.
In the embodiment, the data characteristics of the data to be marked are obtained through a characteristic extraction network; distributing the data characteristics to each hierarchical classification network; obtaining classification results of different layers through a classification network according to data characteristics; wherein, the classification result of the upper layer of classification network determines the enabled next layer of classification network; and taking the classification result of the last layer of classification network as a data annotation result. The method is suitable for data labeling scenes with a plurality of labels, can effectively reduce the labor cost input, and improves the data labeling efficiency and the data labeling quality.
Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an ASIC. Additionally, the ASIC may reside in user equipment. Of course, the processor and the storage medium may reside as discrete components in a communication device.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The foregoing program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (5)

1. A method for annotating data, comprising:
acquiring data characteristics of image data to be marked, wherein the image data to be marked is image data to be classified;
distributing the data characteristics to various hierarchical classification networks;
obtaining classification results of different layers through the classification network according to the data characteristics;
the classification results of the classification networks of all the layers are sequentially checked from the classification result of the first layer of classification network;
if all the layered classification results pass the examination and verification, obtaining the classification result of the last layer of classification network;
taking the classification result of the classification network as an image data annotation result;
before distributing the data characteristics to the respective hierarchical classification networks, further comprising:
according to the characteristics of the image data to be marked, carrying out hierarchical processing on the categories to generate a hierarchical label network;
setting a classification network at each layer of the hierarchical label network;
the method for sequentially auditing the classification results of the classification networks of all the layers from the classification result of the first layer of classification network comprises the following steps:
judging whether the classification result of the current layer classification network passes the audit or not;
if the classification result of the current layer of classification network passes the examination, starting the examination of the classification result of the next layer of classification network;
if the classification result of the current-layer classification network does not pass the examination, judging whether the classification result of the current-layer classification network belongs to a preset candidate label set;
if the label belongs to the preset candidate label set, selecting a correct classification label from the preset candidate label set as a classification result;
if the label belongs to the preset candidate label set, determining that the classification result does not belong to the category;
if the classification result of the current layer of classification network does not belong to the category, feeding back the data characteristics to the previous layer of classification network so as to carry out iterative training on the previous layer of classification network until the previous layer of classification network outputs a correct classification result;
the method further comprises the following steps:
training a simple classifier for the classification network of each layer, wherein the classification number of the simple classifier is two classification, five classification or ten classification;
and determining the enabled next-layer classification network according to the classification result of the previous-layer classification network.
2. The method according to claim 1, wherein the examining the classification results of the classification networks of the respective hierarchical layers in sequence from the classification result of the first layer classification network further comprises:
obtaining the pre-estimated accuracy score of the classification result of the current layer classification network; wherein the pre-estimated accuracy score is positively correlated with the number of times that the classification result is correct;
and if the estimated accuracy score of the classification result of the current-layer classification network is greater than a preset threshold value, directly skipping the examination and verification of the classification result of the current-layer classification network.
3. A data annotation device, comprising:
the system comprises an extraction module, a classification module and a classification module, wherein the extraction module is used for acquiring data characteristics of image data to be marked, and the image data to be marked is image data to be classified;
the distribution module is used for distributing the data characteristics to each hierarchical classification network;
the classification module is used for acquiring classification results of different layers through the classification network according to the data characteristics;
the output module is used for taking the classification result of the classification network as an image data annotation result;
an audit module to:
the classification results of the classification networks of all the layers are sequentially checked from the classification result of the first layer of classification network;
if all the layered classification results pass the examination and verification, obtaining the classification result of the last layer of classification network;
the construction module is used for carrying out layering processing on the categories according to the characteristics of the image data to be marked before distributing the data characteristics to each layered classification network to generate a layered label network; setting a classification network at each hierarchy of the hierarchical label network;
the auditing module is further configured to:
judging whether the classification result of the current layer classification network passes the audit or not;
if the classification result of the current layer of classification network passes the verification, starting the verification of the classification result of the next layer of classification network;
if the classification result of the current-layer classification network does not pass the verification, judging whether the classification result of the current-layer classification network belongs to a preset candidate label set;
if the label belongs to the preset candidate label set, selecting a correct classification label from the preset candidate label set as a classification result;
if the label belongs to the preset candidate label set, determining that the classification result does not belong to the category;
the feedback module is used for feeding back the data characteristics to the previous layer of classification network if the classification result of the current layer of classification network does not belong to the category, so as to carry out iterative training on the previous layer of classification network until the previous layer of classification network outputs a correct classification result;
the construction module is also used for training a simple classifier for the classification network of each layer, and the classification number of the simple classifier is two classification, five classification or ten classification;
the classification module is also used for determining the enabled next-layer classification network according to the classification result of the previous-layer classification network.
4. A data annotation system, comprising: a processor and a memory; the memory stores executable instructions of the processor; wherein the processor is configured to perform the data annotation method of claim 1 or 2 via execution of the executable instructions.
5. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the data annotation method of claim 1 or 2.
CN201910668500.5A 2019-07-23 2019-07-23 Data labeling method, device, system and storage medium Active CN110543891B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910668500.5A CN110543891B (en) 2019-07-23 2019-07-23 Data labeling method, device, system and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910668500.5A CN110543891B (en) 2019-07-23 2019-07-23 Data labeling method, device, system and storage medium

Publications (2)

Publication Number Publication Date
CN110543891A CN110543891A (en) 2019-12-06
CN110543891B true CN110543891B (en) 2022-07-26

Family

ID=68709793

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910668500.5A Active CN110543891B (en) 2019-07-23 2019-07-23 Data labeling method, device, system and storage medium

Country Status (1)

Country Link
CN (1) CN110543891B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111582366B (en) * 2020-05-07 2023-10-31 清华大学 Image processing method, device and equipment
CN112396026A (en) * 2020-11-30 2021-02-23 北京华正明天信息技术股份有限公司 Fire image feature extraction method based on feature aggregation and dense connection

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103324954A (en) * 2013-05-31 2013-09-25 中国科学院计算技术研究所 Image classification method based on tree structure and system using same
CN108875821A (en) * 2018-06-08 2018-11-23 Oppo广东移动通信有限公司 The training method and device of disaggregated model, mobile terminal, readable storage medium storing program for executing
CN109614614A (en) * 2018-12-03 2019-04-12 焦点科技股份有限公司 A kind of BILSTM-CRF name of product recognition methods based on from attention
CN109614517A (en) * 2018-12-04 2019-04-12 广州市百果园信息技术有限公司 Classification method, device, equipment and the storage medium of video

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103324954A (en) * 2013-05-31 2013-09-25 中国科学院计算技术研究所 Image classification method based on tree structure and system using same
CN108875821A (en) * 2018-06-08 2018-11-23 Oppo广东移动通信有限公司 The training method and device of disaggregated model, mobile terminal, readable storage medium storing program for executing
CN109614614A (en) * 2018-12-03 2019-04-12 焦点科技股份有限公司 A kind of BILSTM-CRF name of product recognition methods based on from attention
CN109614517A (en) * 2018-12-04 2019-04-12 广州市百果园信息技术有限公司 Classification method, device, equipment and the storage medium of video

Also Published As

Publication number Publication date
CN110543891A (en) 2019-12-06

Similar Documents

Publication Publication Date Title
Schneider et al. Deep learning object detection methods for ecological camera trap data
Van Horn et al. The inaturalist species classification and detection dataset
Liang et al. Alice: Active learning with contrastive natural language explanations
CN110020201B (en) User type automatic labeling system based on user portrait clustering
CN110215216B (en) Behavior identification method and system based on skeletal joint point regional and hierarchical level
CN110348580A (en) Construct the method, apparatus and prediction technique, device of GBDT model
CN111242948B (en) Image processing method, image processing device, model training method, model training device, image processing equipment and storage medium
CN110543891B (en) Data labeling method, device, system and storage medium
CN111027600B (en) Image category prediction method and device
CN107330452B (en) Clustering method and device
Atanbori et al. Convolutional neural net-based cassava storage root counting using real and synthetic images
CN114972222A (en) Cell information statistical method, device, equipment and computer readable storage medium
Nikanjam et al. Design smells in Deep Learning programs: an empirical study
CN110688471A (en) Training sample obtaining method, device and equipment
CN113705215A (en) Meta-learning-based large-scale multi-label text classification method
KR20200082490A (en) Method for selecting machine learning training data and apparatus therefor
Belharbi et al. Min-max entropy for weakly supervised pointwise localization
CN111339285B (en) BP neural network-based enterprise resume screening method and system
CN113704389A (en) Data evaluation method and device, computer equipment and storage medium
Wigness et al. Efficient label collection for image datasets via hierarchical clustering
Sun et al. Improving and evaluating deep learning models of cellular organization
US11875250B1 (en) Deep neural networks with semantically weighted loss functions
CN113627464B (en) Image processing method, device, equipment and storage medium
CN115357220A (en) Industrial APP development-oriented crowd-sourcing demand acquisition method
Das et al. GOGGLES: Automatic training data generation with affinity coding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant