CN112966777B

CN112966777B - Semi-automatic labeling method and system based on human-computer interaction

Info

Publication number: CN112966777B
Application number: CN202110328124.2A
Authority: CN
Inventors: 张新钰; 李骏; 李志伟; 刘宇红; 王力; 卢一倩
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2021-03-26
Filing date: 2021-03-26
Publication date: 2021-11-30
Anticipated expiration: 2041-03-26
Also published as: CN112966777A

Abstract

The invention discloses a semi-automatic labeling method and a semi-automatic labeling system based on human-computer interaction, wherein the method comprises the following steps: fusing the RGB image to be marked with the generated first Gaussian heat map; preprocessing the fused image; inputting the preprocessed fusion image into a pre-established and trained semi-automatic labeling model, and labeling a plurality of prediction frames for the RGB image to be labeled; when the prediction frame does not meet the requirements, the prediction frame which does not meet the requirements is corrected by generating a second Gaussian heatmap. The method of the invention uses the Gauss heat map as prior information to detect the target, thereby achieving the expected effect of semi-automatic marking; the defect that manual labeling is time-consuming and labor-consuming can be overcome, and the labeling precision is improved.

Description

Semi-automatic labeling method and system based on human-computer interaction

Technical Field

The invention belongs to the field of target detection, particularly relates to a method for achieving a labeling purpose by utilizing extra clicks as prior information and fusing with an RGB (red, green and blue) image, and particularly relates to a semi-automatic labeling method and system based on human-computer interaction.

Background

With the rapid development of technologies such as internet, machine learning, big data and cloud computing, various information data are continuously increasing at an exponential speed, and under the background of big data era, the development of computer vision is relatively mature, so that diversified industrial applications are developed. The labeling data set is an important step for deep learning, but the labeling data is a very tedious work, and the semi-automatic labeling can reduce the workload. Even if the existing semi-automatic marking tool with an open source exists, the used premise is that the higher the precision of the model is, the better the precision is, and if the detection result is inaccurate, the workload is increased, so that the semi-automatic marking tool cannot be used. Moreover, the semi-automatic labeling tool may cause inaccurate detection results and target undetected, so that manual labeling is required for improvement. However, the efficiency and the precision of the model are difficult to balance, so that the mainstream method is considered to be full-manual labeling at present.

However, manually labeling data sets has several disadvantages:

1. data annotation labor cost is high: the target detection algorithm needs massive labeling samples, and the current massive data labeling task depends on manual implementation, "how much manual work is, how much intelligence is, and the cost for manufacturing the labeled data set is high.

2. The quality of manual labeling is difficult to guarantee: the problem of inconsistent judgment scales exists in different annotating personnel and annotating teams, the annotating task is greatly influenced by the subjectivity of the annotating personnel and the examiners, certain annotation errors can be introduced, and the data consistency is difficult to guarantee.

3. Marking threshold height of the target detection data set: compared with massive data labeling requirements, labeling personnel in the professional field are too scarce, so that the labeling threshold is high, and the labeling scale is difficult to keep consistent.

Disclosure of Invention

The invention aims to overcome the technical defects and provides a method for realizing semi-automatic labeling based on a simulated click fusion RGB channel on the basis of a Faster rcnn model. The method can provide the picture with position information for detecting the target when the Faster rcnn fails to detect the target, and solves the problem that the Faster rcnn is easy to mistakenly detect or fail to detect due to the influence of conditions such as illumination conditions, shielding and shadows when the Faster rcnn detects the target. And on the basis of limiting the generation of the prediction frame according to the position information provided in advance, the probability of the generation of the false detection frame is inhibited, the semi-automatic labeling process based on human-computer interaction is effectively realized, and the detection effect and the robustness of the model are improved.

In order to achieve the above object, embodiment 1 of the present invention provides a semi-automatic labeling method based on human-computer interaction, where the method includes:

fusing the RGB image to be marked with the generated first Gaussian heat map;

preprocessing the fused image;

inputting the preprocessed fusion image into a pre-established and trained semi-automatic labeling model, and labeling a plurality of prediction frames for the RGB image to be labeled;

and when the prediction frame does not meet the requirements, correcting the prediction frame which does not meet the requirements by generating a second Gaussian heatmap.

As an improvement of the above method, the RGB image to be annotated is fused with the generated first gaussian heat map; the method specifically comprises the following steps:

determining a region where a target is located on the RGB image to be marked by using the upper left coordinate and the lower right coordinate in the xml file corresponding to the RGB image to be marked, setting all pixel points in the region where the target is located to be 255, and setting pixel points outside the region where the target is located to be 0, thereby generating a Mask image;

generating a plurality of simulation points randomly in the central range of the Mask image, thereby generating a first Gaussian heat map heat₁：

Wherein, (x, y) is the point coordinate on the picture, (x)_1,m,y_1,m) The coordinate of the mth simulation point is shown, and M is the number of the simulation points; the sigma value is 10, r₁4, representing the radiation range of the simulation point;

the RGB image to be marked is compared with the generated first Gaussian heat map heat₁And performing add operation.

As an improvement of the above method, the semi-automatic labeling model is a Resnet50+ FPN structure with attention mechanism added.

As an improvement of the above method, the method further comprises: the method for training the semi-automatic labeling model specifically comprises the following steps:

establishing a data set for training a model; the data set includes: the system comprises a label file set and an image file set, wherein the label file set comprises a plurality of xml files, the image file set comprises a plurality of RGB images, and the xml files and the RGB images are in one-to-one correspondence;

traversing each RGB image in the data set, determining an area where a target is located on the RGB image by using the upper left coordinate and the lower right coordinate in the corresponding xml file, then setting all pixel points in the area where the target is located to be 255, and setting pixel points outside the area where the target is located to be 0; thereby generating a Mask image, and then generating a first gaussian heat map using the Mask image;

fusing the three-channel RGB image and the first Gaussian heat map, and then preprocessing the fused image; as input for a semi-automatic annotation model;

and setting the size, batch processing quantity, training round times and learning rate of each round of the encoder and the decoder for semi-automatic labeling model training, and training the model.

As an improvement of the above method, when the prediction frame does not meet the requirement, the correction operation is performed on the prediction frame which does not meet the requirement by generating a second gaussian heat map; the method specifically comprises the following steps:

step S1) carrying out intersection ratio calculation on each prediction frame output by the model and a GT frame to obtain the value of the IOU; if the IOU is larger than the threshold value, the prediction frame is not corrected, otherwise, the prediction frame is processed, and the step S2 is executed);

step S2) performing deviation calculation on the prediction frame to be corrected and the coordinates of the xml file, and generating a plurality of simulation points at random according to the deviation to generate a second Gaussian heatmap heat₂：

Wherein (x, y) is the coordinates of a point on the image, (x)_2,n,y_2,n) Is the nth simulation point, and N is the number of the simulation points; the sigma value in the formula is 10, r₂＝6；

Step S3) the three-channel RGB image and the single channel of the generated second Gaussian heatmap are spliced, the semi-automatic labeling model is input again, the image of the labeling prediction frame is output, and the step S1) is carried out until the intersection ratio of all the prediction frames and the GT frame is smaller than the threshold value.

The embodiment 2 of the invention provides a semi-automatic labeling system based on human-computer interaction, which comprises: the system comprises a trained semi-automatic labeling model, a fusion module, a preprocessing module, a labeling module and a correction module;

the fusion module is used for fusing the RGB image to be marked with the generated first Gaussian heat map;

the preprocessing module is used for preprocessing the fused image;

the labeling module is used for inputting the preprocessed fusion image into the trained semi-automatic labeling model and labeling the RGB image to be labeled with a plurality of prediction frames;

and the correction module is used for correcting the prediction frame which does not meet the requirement by generating a second Gaussian heatmap when the prediction frame does not meet the requirement.

Compared with the prior art, the invention has the advantages that:

1. according to the method, firstly, extra clicks are used as prior information to detect the target, and then the expected effect of semi-automatic labeling is achieved; then, the false detection rate is reduced through punishment on the false detection target, and the detection effect is further improved;

2. the semi-automatic labeling model of the method not only greatly improves the precision of the original model, but also has good migration effect on different data sets; the defect that manual marking is time-consuming and labor-consuming can be overcome, and detection near the current position can be actively enhanced by the model only by manually providing prior information of the position of the target;

3. in the method, the predicted position may have slight deviation sometimes, so that manual correction is needed, but the problems of high cost and long time consumption exist in the manual correction; therefore, the invention adds a correction part to correct the target with an unsatisfactory prediction result so as to ensure that the prediction is correct;

4. the invention improves the original structure, and adds a correction operation on the basis of interactive detection, so that the correction purpose is achieved.

Drawings

FIG. 1 is a schematic diagram of a semi-automatic labeling method based on human-computer interaction according to embodiment 1 of the present invention;

FIG. 2 is a diagram of a semi-automatic labeling model of the present invention;

FIG. 3 is a schematic diagram of an SKNet part of a semi-automatic labeling model;

the specific implementation mode is as follows:

in order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It is to be understood that the described embodiments are only a few embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Before describing the embodiments of the present invention, the related terms related to the embodiments of the present invention are first explained as follows:

RGB image: the method refers to a color image acquired by a monocular camera, and the color image is a three-channel image.

Labeling: and representing class labels used for the supervised training of the target detection neural network, and labeling the class of each target of the color image.

As shown in fig. 1, embodiment 1 of the present invention provides a semi-automatic labeling method based on human-computer interaction, which includes the following specific implementation steps:

step 1) fusing an RGB image to be marked with a Gaussian heat map generated by simulated clicking; the method specifically comprises the following steps:

step 101) generating a Mask image through an xml file corresponding to an image to be marked;

determining a region where a target is located on an RGB image to be marked by using an upper left coordinate and a lower right coordinate in an xml file, setting all pixel points in the region where the target is located to be 255 (any number larger than 0 can be set), and setting pixel points outside the region where the target is located to be 0, so that the region where the target is located can be divided from a background region; thereby generating a Mask image, wherein the Mask image is a gray level image;

step 102) generating a first Gaussian heat map by using Mask images;

randomly generating a point in the central range of the Mask image, generating a first Gaussian heat map by taking the point as the center, and adding all the generated Gaussian heat maps if the simulated click is a plurality of points to form the first Gaussian heat map; fusing the generated first Gaussian heatmap serving as prior information with the RGB image to be marked, and improving the punishment coefficient of the false detection target to reduce the false detection rate;

the coordinates of the simulated click, i.e., the center point of the gaussian heat map, the resulting first gaussian heat map functions in an additive fashion because there may be more than one target per picture.

Generating a first Gaussian heat map heat₁The formula of (1) is:

wherein, (x, y) is the point coordinate on the picture, (x)_1,m,y_1,m) Coordinates of points generated for the mth simulation click, wherein M is the number of simulation points; the sigma value in the formula is 10, r₁4, represents the radiation range of the point simulating the click.

Since a matrix is obtained through the above process, the matrix is converted into the same format as the picture, and thus the result is processed. The processing procedure is to establish a matrix with the same size as the picture size and initialize the matrix, wherein the initialized dimension is (h, w, 1), h is the height of the image, w is the width of the image, and 1 is the number of channels. Since the type obtained at this time is an array and the desired data type is a tensor, it is necessary to convert the array type into a tensor type.

Step 103) fusing the RGB image to be annotated with the generated first Gaussian heat map;

firstly, judging whether the dimensionality of the RGB image is the same as that of the Gaussian heat map generated in the process, and if the dimensionality is the same, performing add operation on the RGB image and the first Gaussian heat map.

Step 2) preprocessing the fused image: whitening, turning, cutting and other operations are carried out on the image;

step 3) constructing and training a semi-automatic labeling model, inputting the preprocessed fused image into the semi-automatic labeling model, and labeling a prediction frame on the fused image;

step 301) constructing a semi-automatic labeling model;

as shown in fig. 2, a basic backbone of the semi-automatic labeling model adopts a supervised model, and a backbone network selects Resnet50 and optimizes the backbone network by selecting an FPN structure. Since the RPN in the fast RCNN is done by the last layer's features. And (3) convolving the characteristics of the last layer by 3x3 to obtain convolutional layers of 256 channels, and convolving by two 1x1 to obtain a category score and a frame regression result. The RPN sub-network behind the feature layer is referred to herein as a network header. Each point on the feature layer is preset with 9 boxes in the manner of anchors. The boxes themselves contain different dimensions and different aspect ratios.

The FPN improves on RPN by applying a network header to each P layer. Since each P layer has different scale information relative to the original picture, the author separates the scale information in the original RPN, allowing each P layer to process only a single scale information. Specifically, for {32 }²、64²、128²、256²、512²Anchors of the five scales correspond to five feature layers of { P2, P3, P4, P5 and P6}, respectively. Each feature layer processes candidate frames with three length-width ratios of 1:1, 1:2 and 2: 1. P6 is designed specifically for RPN networks to handle candidate boxes of 512 sizes. It is obtained by down-sampling P5. In addition, the parameters of the above 5 network headers are shared.

In order to further improve the model accuracy, the invention adds an attention mechanism in the front part of the backbone network, and the attention mechanism adopts a sknet method, as shown in fig. 3;

specifically, Selective Kernel Networks, which initiate the concept of dynamically modulating their own receptive field from cortical neurons according to different stimuli, are products that combine the ideas of SE operator, Merge-and-Run maps, and anchorage on entrapment block. In terms of design concept, simple is compared, namely Selective Kernel modification is carried out on all convolution kernels larger than 1, the convenience of smaller theoretical parameters and floats brought by group/depthwise convolution is fully utilized, and the design of adding multiple paths and dynamic selection does not bring large overhead (but the actual acceleration optimization of group/depthwise is not particularly good at present, so that the actual speed is slightly slow), and table 1 can be specifically referred. The design makes it very easy to select Kernel for any network, and only the convolution larger than 1 needs to be switched.

TABLE 1

Step 302), training a semi-automatic labeling model;

the model takes its own input value as a supervised label value of the neural network, and an exemplary training process of the semi-automatic labeling model is described below.

Step 302-1) building a data set for training a model; the data set includes: the system comprises a label file set and an image file set, wherein the label file set comprises a plurality of xml files, the image file set comprises a plurality of RGB images, and the xml files and the RGB images are in one-to-one correspondence.

Step 302-2) traversing each RGB image in the data set, determining a region where the target is located on the RGB image by using the upper left coordinate and the lower right coordinate in the xml file, then setting all pixel points in the region where the target is located to be 255 (any number larger than 0 can be set), and setting pixel points outside the region where the target is located to be 0, so that the region where the target is located can be divided from a background region; thereby generating a Mask image, and then generating a first gaussian heat map using the Mask image;

step 302-3) fusing the three-channel RGB image and the first Gaussian heat map, and then preprocessing the fused image; as input for a semi-automatic annotation model;

step 302-4) implementing a supervised neural network based on a tool pytorre, setting the size, batch processing number, training round times, learning rate of each round and other hyper-parameters needing to be artificially defined of an encoder and a decoder of the network, starting training, obtaining a prior frame with a larger IOU corresponding to a GT frame in a label coding process, calculating a prediction result which should be obtained by an anchor with a larger IOU (Intersection-over-Unit) corresponding to a real frame, and finding the anchor corresponding to each real frame of each picture for training, namely the prior frame with the largest IOU. For the predicted frame position obtained by network prediction not being the actual frame position, the position of the final frame can be obtained after the combination processing with the prior frame.

The decoding process requires two parameters, the convolved location and anchor. And then, calculating errors according to the loss function, updating network parameters by using a back propagation algorithm, and completing the first round of training until all rounds of training are completed. In order to accelerate the training speed of the network, the activation function of the network is selected as a ReLU function when the network is trained.

Wherein the loss function for the RPN is:

wherein:

p_i: probability of the anchor prediction being the target;

when the number is 0, the negative label is negative label, and when the number is 1, the positive label is positive label;

t_i＝(t_x,t_y,t_w,t_h) Is a vector representing the 4 parameterized coordinates of the prediction box, t_xTo predict the abscissa of the center of the frame, t_yAs ordinate of the center of the prediction box, t_wTo predict the width of the frame, t_hIs the high of the prediction box;

coordinate vectors of the GT boxes corresponding to anchors where the targets are detected;

L_cls: the class loss adopts a cross entropy function, is the logarithmic loss of two classes (target and non-target), and has the formula

L_reg: regression loss is calculated by adopting smooth L1 function, and the formula is as follows:

this term means that there is only foreground

There is a regression loss, otherwise

There was no regression loss. The output of the classification layer and the regression layer is respectively composed of { p_iAnd t_iAre composed of N_clsAnd N_resAnd a balance weight λ normalization, where λ is 10, the normalized value of the classification term being the size of mini _ atch, i.e., N_clsThe normalized value of the regression term is the number of anchor positions, i.e., N, 256_res2400(40 × 60), so that the classification terms are weighted more or less equally to the regression terms.

The loss function for Faster rcnn is:

L_loc(. cndot.) is a regression loss function, with the first formula being the loss formula for u as the foreground and the second formula being the loss formula for u as the background.

p: a probability distribution predicted by the classifier; u: true category labels；t^u: t matrix corresponding to GT frame; v: predicting a corresponding t matrix; in this case, λ is 1, and the loss is obtained as a weighted sum of the two, and the regression loss is not considered if the classification is background.

Step 302-5) when the prediction box is not ideal, a correction operation can be performed, which specifically comprises:

step 302-5-1) when training a batch of data, carrying out intersection ratio calculation on each prediction frame output by the model and a GT frame (target prediction area) to obtain the value of IOU; if the IOU is larger than a certain threshold value, the predicted frame is not corrected by default, otherwise, the predicted frame is processed, and the step 301-5-2) is carried out; the threshold is set to 0.7 at this time, meaning that prediction blocks with an intersection ratio less than 0.7 are reprocessed.

Step 302-5-2) performing deviation calculation on the prediction frame to be corrected and the coordinates of the corresponding xml file, and generating a plurality of simulation points again at random according to the deviation, thereby generating a second Gaussian heatmap heat₂：

Wherein (x, y) is the coordinates of a point on the image, (x)_2,n,y_2,n) Is the nth simulation point, and N is the number of the simulation points; the sigma value in the formula is 10, r₂And 6, representing the radiation range of the simulation point. The irradiance range setting is larger than the previous step and represents a gaussian heat map generated with more emphasis on the correction points.

Step 302-5-3) then the three-channel RGB image is stitched with the single channel of the second gaussian heatmap, the input size of the network is (batch _ size, 3, h, w), and the stitched network input is (batch _ size, 4, h, w).

And further optimizing the trained model, realizing the visualization of human-computer interaction, testing the mobility of the data set and improving the migration effect as much as possible.

Step 303) inputting the preprocessed fused image into a trained semi-automatic labeling model, and labeling a plurality of prediction frames on the fused image;

step 4) when the prediction frame output in the step 303) has an undesired prediction frame, correcting the undesired prediction frame, including:

the correction is taken as an independent part, and if the prediction condition is not ideal, the correction part starts to work. And generating a new simulation point by calculating the deviation between the GT frame and the prediction frame, thereby enabling the model to better detect the target.

Step 401) judging whether correction is needed according to the value of the IOU;

performing intersection ratio calculation on each prediction frame output by the model and a GT frame (target prediction region) to obtain the value of the IOU; if the IOU is larger than a certain threshold value, the predicted frame is not corrected by default, otherwise, the predicted frame is processed, and the step 402 is entered); the threshold is set to 0.7 at this time, meaning that prediction blocks with an intersection ratio less than 0.7 are reprocessed.

Step 402) performing deviation calculation on the prediction frame needing to be corrected and the coordinates of the xml file, and generating a plurality of simulation points again at random according to the deviation, thereby generating a second Gaussian heatmap heat₂：

And 403) fusing the three-channel RGB image with the generated single channel of the second Gaussian heatmap, inputting the semi-automatic labeling model again, outputting the image labeled with the prediction frame, and turning to the step 401) until the intersection ratio of all the prediction frames is less than 0.7.

Embodiment 2 of the present invention provides a semi-automatic labeling system based on human-computer interaction, including: the system comprises a trained semi-automatic labeling model, a fusion module, a preprocessing module, a labeling module and a correction module;

the preprocessing module is used for preprocessing the fused image;

Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A semi-automatic labeling method based on human-computer interaction, the method comprising:

fusing the RGB image to be marked with the generated first Gaussian heat map;

preprocessing the fused image;

when the prediction frame does not meet the requirements, correcting the prediction frame which does not meet the requirements by generating a second Gaussian heatmap;

fusing the RGB image to be marked with the generated first Gaussian heat map; the method specifically comprises the following steps:

Wherein (x, y) is the coordinates of a point on the image, (x)_1,m,y_1,m) The coordinate of the mth simulation point is shown, and M is the number of the simulation points; the sigma value is 10, r₁4, representing the radiation range of the simulation point;

2. The semi-automatic labeling method based on human-computer interaction of claim 1, wherein the semi-automatic labeling model is a Resnet50+ FPN structure for adding attention mechanism.

3. The semi-automatic labeling method based on human-computer interaction of claim 2, characterized in that the method further comprises: the method for training the semi-automatic labeling model specifically comprises the following steps:

fusing the three-channel RGB image and the first Gaussian heat map, and then preprocessing the fused image; as an input to a neural network model;

and setting the size, batch processing number, training round times and learning rate of each round of the encoder and the decoder for training the neural network model, and training the model.

4. The semi-automatic labeling method based on human-computer interaction of claim 3, wherein when the prediction box does not meet the requirement, the prediction box which does not meet the requirement is corrected by generating a second Gaussian heatmap; the method specifically comprises the following steps:

step S1), carrying out intersection ratio calculation on each prediction box output by the semi-automatic labeling model and a GT box to obtain the value of the IOU; if the IOU is larger than the threshold value, the prediction frame is not corrected, otherwise, the prediction frame is processed, and the step S2 is executed);

Step S3) the three-channel RGB image and the single channel of the generated second Gaussian heatmap are spliced, the semi-automatic labeling model is input again, the image labeled with the prediction frame is output, and the step S1) is carried out until the intersection ratio of all the prediction frames and the GT frame is smaller than the threshold value.

5. A semi-automatic labeling system based on human-computer interaction, characterized in that the system comprises: the system comprises a trained semi-automatic labeling model, a fusion module, a preprocessing module, a labeling module and a correction module;

the preprocessing module is used for preprocessing the fused image;

the correction module is used for correcting the prediction frame which does not meet the requirements by generating a second Gaussian heatmap when the prediction frame does not meet the requirements;

the specific processing procedure of the fusion module comprises the following steps: