CN110781941A

CN110781941A - Human ring labeling method and device based on active learning

Info

Publication number: CN110781941A
Application number: CN201910995320.8A
Authority: CN
Inventors: 周镇镇; 李峰
Original assignee: Suzhou Wave Intelligent Technology Co Ltd
Current assignee: Suzhou Wave Intelligent Technology Co Ltd
Priority date: 2019-10-18
Filing date: 2019-10-18
Publication date: 2020-02-11

Abstract

The invention relates to a human ring labeling method and a human ring labeling device based on active learning, wherein the method comprises the following steps: establishing a target detection model by using the labeled sample data; inputting the sample data which is not marked into the target detection model for testing so as to calculate the classification doubt degree, the positioning stability and the positioning compactness of the sample data which is not marked; calculating the comprehensive score of the unlabeled sample according to the classification doubt degree, the positioning stability and the positioning compactness; extracting unlabeled sample data with the comprehensive score within a preset range and carrying out manual labeling; optimizing a target detection model by using manually marked sample data; and circularly executing the first four steps until the target detection model meets the termination condition, and labeling the residual unlabeled sample data based on the trained target detection model. The method of the invention is utilized to relieve the contradiction between the high quality requirement of the image annotation in the field of computer vision and the time-consuming and labor-consuming supply and demand of the annotation process, thereby more effectively carrying out sample annotation.

Description

Human ring labeling method and device based on active learning

Technical Field

The invention relates to the technical field of artificial intelligence. The invention further relates to a human ring labeling method and device based on active learning.

Background

With the development of artificial intelligence technology, the application fields of computer vision are more and more extensive, including robotics, automatic driving, intelligent medical treatment and the like. The greatest driving force in the development of computer vision technology at present is machine learning or deep learning technology. The current mainstream is the deep learning technology. The deep learning technology can be used for tasks such as target detection, target tracking, image classification, image segmentation and the like, and is based on the fact that a large amount of labeling information of digital images is needed.

The conventional manual image file labeling method and the manual and automatic labeling method based on active learning are still low in efficiency and huge in cost and time consumption, so that the number of labeled images in the image field is limited directly, and the rapid development of the image field technology is limited.

Aiming at the defects in the prior art, an optimized labeling method needs to be provided for solving the problems of limited task and low efficiency of the existing labeling tool.

Disclosure of Invention

In one aspect, the present invention provides a human ring labeling method based on active learning based on the above mentioned objectives, wherein the method comprises the following steps:

establishing a target detection model by using the labeled sample data;

inputting the sample data which is not marked into the target detection model for testing so as to calculate the classification doubt degree, the positioning stability and the positioning compactness of the sample data which is not marked;

calculating the comprehensive score of the unlabeled sample according to the classification doubt degree, the positioning stability and the positioning compactness;

extracting unlabeled sample data with the comprehensive score within a preset range and carrying out manual labeling;

optimizing a target detection model by using manually marked sample data;

and circularly executing the first four steps until the target detection model meets the termination condition, and labeling the residual unlabeled sample data based on the trained target detection model.

According to an embodiment of the active learning-based human ring labeling method of the present invention, the establishing a target detection model using labeled sample data further comprises:

collecting the labeled sample data, and preprocessing the labeled sample data;

and establishing a target detection model by utilizing the preprocessed sample data.

According to the embodiment of the human ring labeling method based on active learning, the preprocessing at least comprises scale scaling, equalization and normalization.

using a back bone feature extraction network VGG16 Conv1 to Conv5 layer of a Faster RCNN framework;

selecting a plurality of anchors scales and a plurality of anchors proportions;

conv6 used 512 3 × 3 convolution kernels, filled with zero values, with a step size of 1;

conv7 used 512 5 × 5 convolution kernels, filled with zero values, with a step size of 1;

and respectively setting the threshold values of the intersection ratio of the non-maximum suppression of the training annotation data and the test unlabeled data.

According to the embodiment of the human ring labeling method based on active learning, the classification suspicion degree is the value with the highest possibility in the prediction results of different classes of the given target frame of the unlabeled data, and the calculation formula of the classification suspicion degree is as follows:

U _B(B)＝1-P _max(B)。

according to an embodiment of the active learning-based human ring labeling method of the present invention, the closeness of localization is a closeness degree of localization of the prediction target frame of the unlabeled data with respect to the corresponding candidate region generated by the final classifier, and a calculation formula of the closeness of localization is:

wherein the content of the first and second substances,

is the jth predicted targetFrame

The positioning tightness of the positioning device is improved,

is input to the final classifier generation

The corresponding candidate region of (a).

According to the embodiment of the active learning-based human ring labeling method, the positioning stability is the stability of the positioning of the target frame without labeled data, and the calculation formula of the positioning stability is as follows:

wherein the content of the first and second substances,

is an object frame N is the noise level.

According to an embodiment of the active learning-based human ring labeling method, the comprehensive score is a weighted sum of the classification doubt degree, the positioning stability and the positioning closeness degree.

According to an embodiment of the method for human ring annotation based on active learning, the termination condition is a predetermined number of active cycles and/or a predetermined IoU index.

In another aspect, the present invention further provides a human ring labeling device based on active learning, wherein the device includes:

at least one processor; and

a memory storing processor-executable program instructions that, when executed by the processor, perform the steps of the method of any of the preceding embodiments.

By adopting the technical scheme, the invention at least has the following beneficial effects: the contradiction between high quality requirement of image labeling in the current computer vision field and time-consuming and labor-consuming supply and demand in the labeling process is relieved, the data labeling amount required by deep learning is effectively reduced through an active learning method, and the detection performance equivalent to that of the image labeled by more labels is achieved; the comprehensive score of the unlabeled samples is calculated by utilizing the classification doubt degree, the positioning stability and the positioning compactness, so that the unlabeled samples which need to be labeled manually most can be selected more effectively, and the samples are labeled more effectively.

The present invention provides aspects of embodiments, which should not be used to limit the scope of the present invention. Other embodiments are contemplated in accordance with the techniques described herein, as will be apparent to one of ordinary skill in the art upon study of the following figures and detailed description, and are intended to be included within the scope of the present application.

Embodiments of the invention are explained and described in more detail below with reference to the drawings, but they should not be construed as limiting the invention.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are required to be used in the description of the prior art and the embodiments will be briefly described below, parts in the drawings are not necessarily drawn to scale, and related elements may be omitted, or in some cases the scale may have been exaggerated in order to emphasize and clearly show the novel features described herein. In addition, the structural order may be arranged differently, as is known in the art.

FIG. 1 shows a schematic block diagram of an embodiment of a human ring annotation method based on active learning according to the invention;

fig. 2 is a schematic diagram illustrating an active learning process according to another embodiment of the method for labeling human rings based on active learning of the present invention.

Detailed Description

While the present invention may be embodied in various forms, there is shown in the drawings and will hereinafter be described some exemplary and non-limiting embodiments, with the understanding that the present disclosure is to be considered an exemplification of the invention and is not intended to limit the invention to the specific embodiments illustrated.

Fig. 1 shows a schematic block diagram of an embodiment of a method for human ring annotation based on active learning according to the present invention. In the embodiment shown in fig. 1, the method comprises at least the following steps:

s10: establishing a target detection model by using the labeled sample data;

s20: inputting the sample data which is not marked into the target detection model for testing so as to calculate the classification doubt degree, the positioning stability and the positioning compactness of the sample data which is not marked;

s30: calculating the comprehensive score of the unlabeled sample according to the classification doubt degree, the positioning stability and the positioning compactness;

s40: extracting unlabeled sample data with the comprehensive score within a preset range and carrying out manual labeling;

s50: optimizing a target detection model by using manually marked sample data;

s60: and circularly executing the steps S20 to S50 until the target detection model meets the termination condition, and labeling the residual unlabeled sample data based on the trained target detection model.

In order to overcome the defects in the prior art, step S10 is first to establish a target detection model by using labeled sample data, where the target detection model is a model used for labeling sample data in the following process and is a model for testing sample data. The target detection model is not perfect at the beginning of the establishment, so it needs to be further trained repeatedly. Therefore, in step S20, the unlabeled sample data is input into the target detection model for testing to calculate the classification doubt degree, the positioning stability and the positioning compactness of the unlabeled sample data. Then, step S30 calculates the comprehensive score of the unlabeled sample according to the classification doubt, the positioning stability and the positioning closeness obtained in step S20. The tested unlabeled sample data are sorted according to the comprehensive score, then step S40 extracts the unlabeled sample data with the comprehensive score within a predetermined range for manual labeling, usually extracts the unlabeled sample data with the comprehensive score sorted in the front for manual labeling, and step S50 optimizes the current target detection model by using the manually labeled sample data. The steps S20 to S50 are executed in a loop until the target detection model meets the termination condition, and then step S60 labels the remaining unlabeled sample data based on the trained target detection model. Therefore, the training of the target detection model and the process of labeling the sample data based on the model are completed.

In some embodiments of the active learning-based human ring labeling method of the present invention, the step S10 of building a target detection model using the labeled sample data further includes: collecting the labeled sample data, and preprocessing the labeled sample data; and establishing a target detection model by utilizing the preprocessed sample data. In further embodiments, the pre-processing includes at least scaling, equalization, normalization, and the like.

In one or more embodiments of the active learning-based human ring labeling method of the present invention, the step S10 of building a target detection model using the labeled sample data further includes:

selecting a plurality of anchors scales and a plurality of anchors proportions;

the thresholds for The non-maximum suppression nms (non maximum-over-unity) intersection ratios IoU (The intersection-over-unity) for The training annotation data and The test unlabeled data are set, respectively.

That is, in these embodiments, the object detection model uses the Faster RCNN framework and makes the above-described modifications thereto. The last two pooling layers of the backbone feature extraction network VGG16(conv1-conv5) used by fast RCNN are removed, so that the proportion of the positive sample in the candidate target is increased, and the target is small and sparse in the image. Furthermore, the choice of the anchor scale and the anchor proportion is preferably five different anchors scales and three different anchors proportions. In addition, when setting the thresholds for IoU for the NMS of the training annotated data and the test unlabeled data, it is generally guaranteed that the threshold for the training annotated data is greater than the threshold for the test unlabeled data, preferably 0.7 and 0.3, respectively.

In several embodiments of the active learning-based human ring labeling method of the present invention, the classification suspicion degree is a value with the highest possibility among different classification prediction results for a given target frame of unlabeled data, and a calculation formula of the classification suspicion degree is:

U _B(B)＝1-P _max(B)。

when the probability of a certain class is close to 1.0, the probability of other classes is necessarily lower, which indicates that the probability of the detector determining the class is higher; in contrast, when a plurality of categories have similar likelihoods, since the likelihoods of the respective categories sum up to 1, the likelihood of each category is necessarily low. Based on this, for a specific ith picture I _iClassification accuracy of U _C(I _i) It can be calculated from the largest classification suspicion degree in all the target frames.

In some embodiments of the active learning-based human ring labeling method of the present invention, the closeness of localization is a closeness degree of localization of the prediction target frame of the unlabeled data with respect to the corresponding candidate region generated by the final classifier, and a calculation formula of the closeness of localization is:

wherein the content of the first and second substances,

is the jth predicted target frame

The positioning tightness of the positioning device is improved,

is input of the final classificationDevice generation

The corresponding candidate region of (a). The candidate Region refers to a target frame which is obtained by selecting a search or an RPN (Region candidate Network) and may contain a foreground target. Because the target detection is not only used for classifying the picture targets, but also used for positioning the picture targets, the position and the scale of the target frame can be continuously adjusted in the network training process, and the quality of the picture targets can be measured by using the positioning stability.

In several embodiments of the active learning-based human ring labeling method of the present invention, the positioning stability is a stability degree of positioning of a target frame of unlabeled data, and a calculation formula of the positioning stability is as follows:

wherein the content of the first and second substances,

is an object frame

N is the noise level (of the picture).

For a given image I _iThe positioning stability calculation formula is as follows:

wherein M represents the number of the reference target frames, and the weight of each reference target frame is the probability of the highest scoring category, so as to screen the target frames with higher probability.

In one or more embodiments of the active learning-based human ring labeling method of the present invention, the composite score is a weighted sum of the classification doubt degree, the positioning stability and the positioning closeness, and the calculation formula is as follows:

F(I _i)＝αU _C(I _i)+βT _I(I _i)+γS _I(I _i)，

of these, the weights α, γ preferably both take 1.

In some embodiments of the active learning based human ring annotation method of the invention, the termination condition is a predetermined number of active cycles and/or a predetermined IoU criteria. And when the number of active learning cycles reaches a set value or the detection IoU of the target detection model on the verification set meets set indexes, and the target detection model is considered to be mature and complete enough, terminating the training process of the target detection model.

In order to facilitate understanding of the technical solutions of the embodiments of the present invention, the technical solutions of the embodiments of the present invention will be described in more detail by taking the following embodiments as examples. The described embodiments are only some of the embodiments of the present invention. Referring to fig. 2, a schematic diagram of an active learning process of still another embodiment of the active learning-based human ring labeling method according to the present invention is shown. The main implementation process comprises the following steps: the target detector utilizes the existing collected tagged data to perform a target classification and positioning task; screening images in an unmarked sample pool for marking by subsequent marking personnel; conveying the screened images to a labeling person, and adding a labeling target frame and a target category by the labeling person to form a label file; adding an annotation image to an annotation training set; and continuing to train the model by using the original detector by using the new labeled training set.

Further, in this embodiment, the method for labeling human rings based on active learning according to the present invention more specifically includes the following steps and sub-steps:

and step 0, firstly, collecting the data of the existing label, and preprocessing the data to reduce the influence of the data on network training as much as possible because the quality of the data directly influences the effect and the precision of a subsequent target detection algorithm, wherein the preprocessing flow of the data comprises scale scaling, equalization and normalization.

Step 1, training a deep learning model by using the existing labeled data, and using a Faster RCNN frame:

step 1.1, initializing by using ImageNet data set pre-training weight;

step 1.2, reading in a data set through a data generation module, generating a batch required by network batch training, selecting VGG16 as a backbone feature extraction network of fast RCNN, extracting a feature diagram of an image, and modifying a VGG16 network, wherein the step comprises the following steps:

a) removing the last two pooling layers of VGG16(conv1-conv5) for improving the proportion of the positive sample in the candidate target, aiming at the small and sparse target in the image;

b) conv6 used 512 3 × 3 convolution kernels, filled with zero values, with a step size of 1;

c) conv7 used 512 5 × 5 convolution kernels, filled with zero values, with a step size of 1;

step 1.3, continuously transmitting the high-dimensional image features generated in step 1.2 forward to generate higher-dimensional features;

step 1.4, rapidly extracting candidate regions and region scores by using RPN, modifying the setting of an original algorithm anchor, and using five anchor scales (16, 24, 32, 48, 96) and 3 anchor proportions (1:2, 1:1, 2: 1);

step 1.5, calculating the scaling and translation dimensions of the prediction box, and adjusting the original fast RCNN calculation formula as follows:

t _w＝min(log(w/w _a),log(1000/16))

t _h＝min(log(h/h _a),log(1000/16))

t _x＝(x-x _a)/w _a

t _y＝(y-y _a)/h _a，

wherein x, y, w, h respectively represent the center horizontal and vertical coordinates, width and height of the prediction box, and x _a、y _a、w _a、h _aDenotes the center abscissa, ordinate, t, of the anchor _w、t _h、t _x、t _yRespectively representing the position translation scale and the scaling scale of the horizontal coordinate and the vertical coordinate of the prediction frame;

step 1.6, calculating the scaling scale and the translation scale of the calibration frame by using the same formula;

step 1.7, correcting the position of a detection target through a translation scale and a scaling scale to obtain a candidate box, and adjusting the IOU threshold of NMS to be 0.7 during training;

step 1.8, inputting the feature map of 1.3 and the candidate region in 1.7 into ROIploling layer at the same time, and generating high-dimensional features of the corresponding region;

and 1.9, outputting the target area bbox and the score by passing the high-dimensional characteristics of the corresponding area through three fully-connected layers.

And 2, inputting a large amount of unlabelled sample data into the deep learning calculation model for testing, wherein the IOU threshold of the NMS is adjusted to 0.3 during testing.

Step 3, calculating the classification doubt degree, the positioning stability and the positioning compactness of the unlabeled sample:

one of the tasks of target detection is to classify the targets in the image and determine the accuracy of the trained model of the current detector in measuring the target classification. Given the target box B, the classification suspicion degree is calculated as follows:

U _B(B)＝1-P _max(B)，

wherein, U _B(B) Representing the highest likelihood of different categories of predicted outcomes for the target box. When the probability of a certain class is close to 1.0, the probability of other classes is necessarily lower, which indicates that the probability of the detector determining the class is higher; in contrast, when a plurality of categories have similar likelihoods, since the likelihoods of the respective categories sum up to 1, the likelihood of each category is necessarily low. Based on this, for a specific ith picture I _iClassification accuracy of U _C(I _i) Can be calculated from the maximum classification suspicion among all the target frames, i.e.

U _C(I _i)＝max(U _B(B))。

Since the label of the unlabeled image is unknown, after the RPN network outputs candidate regions, it is necessary to estimate that these candidate regions are enough to contain foreground objects. In addition to classifying the targets in the image, the target detection also needs to locate the positions of the targets, and whether the location is accurate is measured by using the location compactness and the location stability.

For a given candidate target box B, the closeness of localization is calculated as follows:

wherein the content of the first and second substances,

represents the jth predicted target box

The positioning tightness of the positioning device is improved,

presentation input final classifier generation The corresponding candidate region of (a).

Defining the score of each target box as J, which is given by the following formula:

multiple predicted target frames are generated for each image, for image I _iThe closeness of positioning score is

The position and the scale of the target frame are continuously adjusted in the network training process, and the quality of the target frame is measured by using the positioning stability. For a given target box, the positioning stability calculation formula is as follows:

wherein the content of the first and second substances,

representing an object box N denotes the noise level of the picture. At different noise levels, the positional stability of an image may measure the tolerance level of the image to noise.

For a given image I _iThe positioning stability is calculated as follows:

Step 4, using the classification doubt degree, the positioning stability and the positioning compactness, calculating and sequencing the comprehensive scores of the unlabeled samples, and defining the score F (I) of the image _i)＝U _C(I _i)+T _I(I _i)+S _I(I _i). And in the active learning process, sequencing the unmarked images from low to high of the comprehensive score.

And 5, extracting 1% of unlabeled samples with the scores close to the former, conveying the unlabeled samples to the labeling personnel for labeling, adding new labeled data into the original data training set in the labeling process after the labeling of the samples, re-training the detector, and circularly performing the whole process.

Step 6, in an actual embodiment, a certain active cycle number and detection IoU index are set, when the active learning cycle number reaches a set value, or the detection IoU of the target detection model to the verification set meets the set index, the training process of the target detection model is terminated, wherein the IOU is calculated by the model prediction result and the verification set ground route.

And 7, detecting the residual unmarked data by adopting the model, and taking the detection result as a label of the sample.

And ending the whole human ring labeling method flow based on active learning.

In addition, it should be noted that the scheme in the above embodiment can also be applied to an online intelligent annotation task of an image, a human ring annotation method based on active learning is called in a background, a data set uploaded to an annotation platform by a user is trained, after a comprehensive score of an image to be annotated is given, the image with a higher annotation value is annotated by the user, and time and expenses required for the image with a higher comprehensive score, which can be well identified by an annotation model, are saved.

The above steps 0 to 7 are used as examples of the method for labeling human rings based on active learning according to the present invention, and are intended to be interpreted and exemplified, wherein the steps, sequence, values, value ranges and the like that are referred to are all understood as preferred or more preferred examples, and should not be construed as limiting the present invention.

at least one processor; and

a memory storing processor executable program instructions which, when executed by the processor, perform the steps of the active learning based human ring annotation method of any one of the preceding embodiments.

The devices and apparatuses disclosed in the embodiments of the present invention may be various electronic terminal apparatuses, such as a mobile phone, a Personal Digital Assistant (PDA), a tablet computer (PAD), a smart television, and the like, or may be a large terminal apparatus, such as a server, and therefore the scope of protection disclosed in the embodiments of the present invention should not be limited to a specific type of device and apparatus. The client disclosed in the embodiment of the present invention may be applied to any one of the above electronic terminal devices in the form of electronic hardware, computer software, or a combination of both.

The computer-readable storage media (e.g., memory) described herein may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. By way of example, and not limitation, nonvolatile memory can include Read Only Memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM), which can act as external cache memory. By way of example and not limitation, RAM is available in a variety of forms such as synchronous RAM (DRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), and Direct Rambus RAM (DRRAM). The storage devices of the disclosed aspects are intended to comprise, without being limited to, these and other suitable types of memory.

It is to be understood that the features listed above for the different embodiments may be combined with each other to form further embodiments within the scope of the invention, where technically feasible. Furthermore, the specific examples and embodiments described herein are non-limiting, and various modifications of the structure, steps and sequence set forth above may be made without departing from the scope of the invention.

In this application, the use of the conjunction of the contrary intention is intended to include the conjunction. The use of definite or indefinite articles is not intended to indicate cardinality. In particular, references to "the" object or to "an" and "an" object are intended to be one of many such objects possible. However, although elements of the disclosed embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated. Furthermore, the conjunction "or" may be used to convey simultaneous features, rather than mutually exclusive schemes. In other words, the conjunction "or" should be understood to include "and/or". The term "comprising" is inclusive and has the same scope as "comprising".

The above-described embodiments, particularly any "preferred" embodiments, are possible examples of implementations, and are presented merely for a clear understanding of the principles of the invention. Many variations and modifications may be made to the above-described embodiments without departing substantially from the spirit and principles of the technology described herein. All such modifications are intended to be included within the scope of this disclosure.

Claims

1. A human ring labeling method based on active learning is characterized by comprising the following steps:

establishing a target detection model by using the labeled sample data;

inputting unlabeled sample data into the target detection model for testing to calculate the classification doubt degree, the positioning stability and the positioning compactness of the unlabeled sample data;

extracting unlabeled sample data with the comprehensive score within a preset range for manual labeling;

optimizing the target detection model by using the manually marked sample data;

2. The method of claim 1, wherein the building a target detection model using labeled sample data further comprises:

collecting the labeled sample data, and preprocessing the labeled sample data;

3. The method of claim 2, wherein the pre-processing comprises at least scaling, equalization, and normalization.

4. The method of claim 1, wherein the building a target detection model using labeled sample data further comprises:

selecting a plurality of anchors scales and a plurality of anchors proportions;

5. The method of claim 1, wherein the classification suspicion degree is a value with the highest possibility among different classification prediction results for a given target box of the unlabeled data, and the calculation formula of the classification suspicion degree is as follows:

U _B(B)＝1-P _max(B)。

6. the method according to claim 1, wherein the closeness of localization is a closeness of localization of the prediction target box with respect to the corresponding candidate region generated by the final classifier for the unlabeled data, and the closeness of localization is calculated by:

wherein the content of the first and second substances,

is the jth predicted target frame

Positioning ofThe degree of tightness is determined by the degree of tightness,

is input to the final classifier generation

The corresponding candidate region of (a).

7. The method according to claim 1, wherein the positioning stability is a stable degree of positioning of the target frame of the unlabeled data, and the calculation formula of the positioning stability is:

wherein the content of the first and second substances,

is an object frame

N is the noise level.

8. The method of claim 1, wherein the composite score is a weighted sum of the classification doubt, the localization stability, and the localization closeness.

9. The method according to claim 1, wherein the termination condition is a predetermined number of active cycles and/or a predetermined IoU criteria.

10. A human ring annotation device based on active learning, the device comprising:

at least one processor; and

a memory storing processor executable program instructions which, when executed by the processor, perform the steps of the active learning based human ring annotation method of any one of claims 1 to 9.