CN110210482B

CN110210482B - Target detection method for improving class imbalance

Info

Publication number: CN110210482B
Application number: CN201910494701.8A
Authority: CN
Inventors: 徐童; 陈卓; 刘�东; 陈恩红
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2019-06-05
Filing date: 2019-06-05
Publication date: 2022-09-06
Anticipated expiration: 2039-06-05
Also published as: CN110210482A

Abstract

The invention discloses a target detection method for improving class imbalance, which comprises the following steps: adding a branch structure for generating a mask graph in a traditional feature detection network to obtain the mask graph with the same size as the feature graph with the specified resolution; training the mask image, distributing positive and negative labels to each mask lattice in the mask image according to a training result, training the anchor frame on the positive mask lattice, neglecting the anchor frame on the negative mask lattice, and not participating in training to finally obtain a trained target detection model; and in the testing stage, screening the anchor frame by using the trained target detection model so as to obtain a target detection result. The method can be well applied to a target detector based on deep learning, and the problem that positive and negative samples in the deep target detector are unbalanced is well solved, so that the accuracy of a target detection algorithm is improved.

Description

Target detection method for improving class imbalance

Technical Field

The invention relates to the technical field of target detection, in particular to a target detection method for improving class imbalance.

Background

Object detection is the basis in computer vision, and its task is to locate all objects from an image and to perform a classification of the objects. Results from various large international authoritative datasets show that the existing deep learning-based target detector is the most advanced detection method, and most of the target detection methods use an anchor frame for prediction of a detection frame. The anchor frame (anchor) can be understood as a preset frame, and the target detector covers objects with different sizes by arranging anchor frames in various patterns. For a W × H feature map, if 9 different types of anchor boxes are provided, there are W × H × 9 anchor boxes, and the centers of the anchor boxes are equally spaced and uniformly distributed on the image from the viewpoint of the input image. Training the target detectors, i.e. unfolding around how to train the anchor boxes, e.g. regarding an anchor box that overlaps more strongly with a real object as a positive sample (positive anchor box), regressing the coordinates of this anchor box in an effort to get it close to the real object area, and using the class of the real object as supervision, performing a classification task on the anchor box; an anchor box that overlaps less with a real object is regarded as a negative sample (negative anchor box), a task of ignoring its coordinate regression, and a classification task of using a supervised anchor box of the kind "background".

However, during the training of the target detector, the training of the anchor block will be affected by a severe class imbalance. This is because objects in the image are limited, the number of positive anchor boxes is small, and the number of negative anchor boxes is much greater than the number of positive anchor boxes, resulting in an unbalanced positive-negative ratio. When the types of the anchor frames are less set, the positive-negative ratio may reach about 1:1000, and when the types of the anchor frames are more set, the positive-negative ratio may reach more than 1: 10000. Such unbalanced ratios will greatly affect the reliability of the target detector, making predictions more biased towards the "background" category with reduced recall.

The existing method for solving the class imbalance is mainly realized through various variants of the bias weighting method. If sampling is carried out during training, the positive-negative ratio is regulated to be 1: 3; or only considering the 'hard sample' with larger loss to balance the positive and negative proportion when the training network carries out reverse transmission; or an adaptive weight is weighted for all samples. This can cause two problems: (1) the introduced parameters are difficult to explain how to set better, and can only be selected by a heuristic method, and the effect depends on a data set; (2) the sampling method only assists in model training through a relaxation means, and does not substantially change the positive and negative proportion of the anchor frame during training, so that the asynchronism between the training and the testing is caused.

Disclosure of Invention

The invention aims to provide a target detection method for improving class imbalance, and the accuracy of target detection is improved.

The purpose of the invention is realized by the following technical scheme:

a method of improving target detection of class imbalance, comprising:

adding a branch structure for generating a mask graph in a traditional feature detection network to obtain the mask graph with the same size as the feature graph with the specified resolution;

training the mask image, distributing positive and negative labels to each mask lattice in the mask image according to a training result, training the anchor frame on the positive mask lattice, neglecting the anchor frame on the negative mask lattice, and not participating in training to finally obtain a trained target detection model;

and in the testing stage, screening the anchor frame by using the trained target detection model so as to obtain a target detection result.

Compared with the traditional method for solving the class unbalance of the anchor frame in the target detector, the method provided by the invention has the advantages that the problem is solved more preferentially, the number of the anchor frames which need to be considered in training and testing is reduced, a simple and effective mask mechanism is designed, the method can be well applied to the target detector based on deep learning, the problem of unbalance of positive and negative examples in the deep target detector is solved better, and the accuracy of a target detection algorithm is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

FIG. 1 is a flow chart of a method for detecting an object with improved class imbalance according to an embodiment of the present invention;

FIG. 2 is a block diagram of a training phase masking mechanism provided by an embodiment of the present invention;

fig. 3 is a schematic diagram of guiding anchor block training by using a mask map according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

The invention discloses a deep learning-based target detection method for the phenomenon of anchor point positive-negative ratio imbalance, which is different from a general bias weighting method, introduces a mask mechanism, and provides an improved class imbalance target detection method, wherein the main principle is as follows: A2D mask is created for the feature map for detection, the width and height of the mask being the same as the width and height of the feature map. The mask map is composed of individual grids, which are called mask grids; the mask lattices correspondingly give scores between 0 and 1 for prediction, whether the region contains a positive anchor frame or not is judged, the positive mask lattices are regarded as positive mask lattices during training, and the higher the scores are, the higher the probability of containing the positive mask lattices is; only selecting an anchor frame in a positive mask grid for training; and obtaining a prediction frame according to the mask score and the class score given by the anchor frame during testing, thereby completing target detection.

As shown in fig. 1, a method for detecting an object with improved class imbalance according to an embodiment of the present invention mainly includes the following steps:

step 1, adding a branch structure for generating a mask graph in a traditional feature detection network to obtain the mask graph with the same size as a feature graph with specified resolution.

In the embodiment of the invention, the traditional backbone network structure is improved, and the backbone network is a convolutional neural network in a target detection model and can be understood as a network for extracting detection features. Different detection layers are arranged on the characteristic diagrams with different resolutions in the backbone network and are used for detecting objects with different scales. By correspondingly improving the structure of the backbone network, the method for improving the target detection imbalance by using the mask map can be more suitable for the method.

As shown in fig. 2, a preferred embodiment of the improved structure is as follows:

1) the traditional feature detection network can extract a plurality of feature maps with different resolutions, down-sample the feature map with each resolution by adding a pooling layer, reduce the feature map with each resolution to the same size as the feature map with the lowest resolution, and then splice all the feature maps. After the pooling-splicing operation is carried out, a feature map which contains objects with different scales and is rich in semantic information can be obtained, the object information with different scales can be correspondingly expressed on the feature map, and the feature map obtained by the pooling-splicing operation is named as a feature map A.

2) Converting the size of the feature map A into w × h × 1 by adding convolution layers, wherein w and h are the width and height of the feature map with the lowest resolution respectively; adding convolution layers is used for reducing dimensionality and reducing the calculation amount on one hand; another aspect is to add a desired learning map that can represent object information in spatial position.

3) And adding a mask layer, and taking the feature map A after size conversion as a mask map. There are w × h grids in the mask map, and each grid is trained by using logistic regression (which will be described in step 2 below). Thus, the modification of the backbone network structure is completed. It should be noted that the above modifications can also be made appropriately according to actual situations.

And 2, training the mask image, distributing positive and negative labels to each mask lattice in the mask image according to a training result, training the anchor frame on the positive mask lattice, neglecting the anchor frame on the negative mask lattice, not participating in training, and finally obtaining the trained target detection model.

The preferred embodiment of this step is as follows:

1) and (5) training a mask graph.

In the embodiment of the present invention, for the mask map obtained in step 2, a logistic regression mode is adopted for training, and the loss function of logistic regression is expressed as follows:

L _mask ＝∑{ylnσ(x)-(1-y)ln[1-σ(x)]}

y is a label of a mask lattice, x is the output of the mask lattice, sigma (x) represents that a sigmoid function is used for transforming x to enable the output of x to be between 0 and 1, and the numerical value obtained by transformation represents the probability that the center of a real block of an object falls into the corresponding mask lattice;

distributing positive and negative labels for each mask lattice according to the numerical value obtained by conversion and the distribution principle of the mask lattice labels; the distribution principle of the mask grid label is as follows: if the center of the real object frame is in the mask grid, distributing the corresponding mask grid as a positive mask grid; if the object is completely outside the mask lattice, the corresponding mask lattice is assigned as a negative mask lattice.

The mask map is a feature map with the lowest resolution, and in order to enable the mask map to be applied to a plurality of detection layers, the mask map is subjected to an upsampling operation in a training process and a test stage which is introduced later so as to be expanded to all detection layers of a traditional feature detection network.

As will be understood by those skilled in the art, the detection layer is used for generating class scores (classification) of corresponding anchor frames by using the features and adjusting coordinate transformation (coordinate regression) of the anchor frames; the detection layer does not participate in the work in the feature extraction stage.

In embodiments of the present invention, it is desirable to generate a mask map to quickly filter "negative samples". The existing target detection method mostly adopts a 'multilayer pyramid' prediction mode, so that upsampling is needed to be carried out and applied to each layer, and therefore the method can play a role in training and testing.

The upsampling method comprises the following steps: filling each mask lattice of the expanded mask image is the copy of the corresponding position value of the expanded mask image; during training, the label distribution of the mask lattice is also subjected to upsampling in the same way; therefore, the mask map can be extended to all detection layers by means of upsampling.

2) Anchor frame training

After having the mask graph, the training of the anchor frame is also different from the general method, in the embodiment of the invention, the training of the anchor frame is carried out according to the positive and negative label types of the mask grid, as shown in fig. 3, for the positive mask grid (for example, three areas of 2 × 2 size marked by thicker lines in fig. 3), the corresponding anchor frame is distributed according to the overlapping ratio (IoU) with the real frame; and for the negative mask lattice, the anchor frame on the negative mask lattice is directly ignored and does not participate in training. Thus, the number of negative anchor frames participating in training is greatly reduced, and the positive-negative ratio of serious unbalance is greatly alleviated.

The overall penalty function for the training anchor box can be expressed as follows:

L＝L _obj +L _mask +L _cls +αL _loc

wherein L is _obj For loss of likelihood of anchor-framed objects, L _mask For loss of mask map, L _cls To classify the loss, L _loc For the loss of the bounding box regression, α is a weighting factor, which can be set to 2, indicating that the model tends to ensure the accuracy of the positioning first.

And 3, in the testing stage, screening the anchor frame by using the trained target detection model so as to obtain a target detection result.

The preferred embodiment of this step is as follows:

setting two thresholds, wherein the first threshold is used for screening mask lattices, and the second threshold is used for screening anchor frames;

processing the input image through the trained target detection model to obtain a corresponding mask image;

screening out a corresponding mask lattice from the mask graph through a first threshold value; screening out a corresponding anchor frame from the screened mask grids through a second threshold value so as to obtain a preliminary prediction result;

according to the flow of a target detection algorithm (which can be realized by a conventional technology), a non-maximum suppression algorithm is adopted for the preliminary prediction result, redundant anchor frames are removed, and then a plurality of prediction frames with the category scores ranked at the front are selected as the target detection result.

The present invention will be described in detail with reference to examples below; it is noted that the network types, specific values of the thresholds, etc. referred to in the following examples are all examples and are not meant to be limiting.

In this example, the architecture of the backbone network is aimed at being improved; setting a mask graph in a model training stage; the three aspects of the detection result are introduced in the model reasoning phase by combining the output of the mask graph.

(1) Improving the structure of the backbone network:

the method is characterized in that a Darknet-53 network is improved as a basic feature network, and as shown in FIG. 2, pooling layers are added on three layers with different scales, namely a 74 layer, a 61 layer and a 36 layer, so that objects can be found by using features from different scales. Taking an input size of 512 × 512 as an example, the resolutions in these three scales are 16 × 16, 32 × 32, and 64 × 64, respectively. Next, we connect the maximum pooling layer with window 4, step 4, window 3, step 2, window 2, and step 1 to these three layers, respectively, and reduce the resolution of their feature maps to 16 × 16. Then, splicing the feature maps in the dimension direction, and combining the feature maps into a feature map which is low in resolution, wide in dimension and rich in semantic information;

it should be noted that the Darknet-53 network used in the present invention is only one of the backbone networks, and other backbone networks such as ResNet-101 and VGG-16 may also be used.

And adding 1 convolution of 1 multiplied by 1 and 1 convolution of 3 multiplied by 3 behind the feature diagram according to the description in the step one to reduce dimensionality and expand semantics. Finally, a 1 × 1 convolution is used to generate the mask map. The original network structure is not changed, and only a branch for generating the mask is added.

It should be noted that, the convolution layers added thereafter can be increased appropriately to further improve the model effect, and the invention only discusses the simplest implementation form;

and thirdly, continuously adding a mask layer to the mask graph belonging to the second step, wherein the mask graph comprises w multiplied by h grids, and each grid is trained by adopting logistic regression. Thus, the modification of the backbone network structure is completed.

(2) Setting a mask graph in a model training stage:

as shown in fig. 2, the dimension of the mask map is 1, each pixel point above it is called a grid, and the value of the grid represents whether the center of an object falls inside the grid. Therefore, after the real boxes in the original image are scaled down according to the size relationship, the width and the height of the real boxes and the corresponding coordinates on the same resolution as the mask image can be obtained. The center point of this box is calculated and the grid on which the center point falls is assigned a positive label and for grids that do not contain objects, a negative label is assigned. Each grid is trained using logistic regression.

With upsampling, the lowest resolution mask map can be applied to all resolution detection layers. For example, if the mask map of 16 × 16 is upsampled to the mask map of 32 × 32, the grid originally at the upper left corner of the mask map of 16 × 16 will correspond to 2 × 2 grids in the mask map of 32 × 32, and the numerical values are filled into the mask map of 32 × 32 according to the copying method. In this simple upsampling manner, the mask tags can be extended to a large resolution, so that different mask lattice attributes on a large-resolution mask map can be obtained.

By upsampling, the mask map is assigned to the detection layer at the respective resolution. Next is introduced how the training of the anchor box is guided by the mask lattice:

as shown in FIG. 3, for the positive mask lattice, the anchor box is still assigned by how much of the overlap ratio (IoU) it overlaps with the real box; and for the negative mask lattice, directly ignoring the anchor frame on the negative mask lattice and not participating in training. Thus, the number of negative anchor frames participating in training is greatly reduced, and the positive-negative ratio of serious unbalance is greatly alleviated. According to statistics, taking an input size of 512 × 512 as an example, the ratio of positive to negative in one batch originally can reach 1:4000, and the ratio of positive to negative in polar imbalance is greatly relieved because the ratio of positive to negative in one batch is maintained at about 1:30 after the object likelihood mask is used.

(3) Combining mask graphs during model inference

When model forward reasoning is used to test images, meshes with too small scores on the mask map are filtered out first. If the mask score is below 0.1 (first threshold), then the grid is screened out; next, the anchor boxes are screened out in the remaining mask grid. If the object likelihood score is less than 0.05 (the second threshold), then the anchor frame is screened out to obtain a preliminary prediction result. And then, according to the flow of the target detection algorithm, after adopting a non-maximum suppression algorithm for the preliminary prediction result, sorting the preliminary prediction result and selecting the first 100 prediction frames to obtain a final detection result.

It should be noted that the mask threshold and the anchor block threshold used for screening may be determined by a small number of forward reasoning during the test phase according to different data sets.

Through a comparison experiment, after a mask mechanism is adopted to relieve imbalance, the improved model can obtain a higher score on the object likelihood compared with the original model; and higher recall rate can be displayed, and the detection accuracy is improved.

For ease of understanding, the following is necessarily complementary to the above process:

(1) establishment of an infrastructure network

The Darknet-53 network is the base network used by YOLOv3, which contains 53 convolutional layers, with a large number of 3 × 3 and 1 × 1 convolutional layers being used in the network. Its accuracy on ImageNet dataset is very similar to that of ResNet-101, indicating that the feature expression capability of the underlying networks of Darknet-53 and ResNet-101 is essentially the same. When the backbone network is selected, a model with higher accuracy can be selected to establish the backbone network according to the accuracy on the ImageNet data set, and the corresponding detection result is possibly better.

(2) Training of model parameters

As shown in fig. 3, how the object likelihood mask supervised graph is applied to the detection layer in the training phase is visualized. For a positive mask lattice, only the anchor boxes within that mask lattice are used for training. The foregoing introduces the use of the values of the mask lattice to assign the positive and negative labels of the mask lattice, so as to screen out the positive mask lattice, which can be understood as preliminary screening, which screens out cells. After the mask grid is screened, we only need to consider the anchor frames in the remaining cells; for the anchor frame, the allocation may be performed as follows: when the overlap ratio of the anchor frame and the corresponding object is greater than the threshold, we will assign a positive label, and when the overlap ratio is less than the threshold, we will assign a negative label. This threshold may be set to 0.5 in general.

In addition, data enhancement methods such as horizontal turning, random matting, color transformation, multi-scale training and the like commonly adopted in a target detection method can be adopted to help improve the robustness of the detector.

(3) Model forward reasoning

A mask threshold (i.e., a first threshold) of 0.1 and an object likelihood threshold (i.e., a second threshold) of 0.05 may be set to help filter out useless prediction boxes. At the same time, we use a non-maximum suppression threshold (i.e., NMS threshold) of 0.6 for redundant box removal. In practice, techniques to fuse convolutional layers and batch normalization layers may also be employed to help speed up.

According to the above description, a total of three parameters to be adjusted are introduced into the added model, and these parameters need to be adjusted only during testing, but need not be adjusted during training, and the table is as follows:

parameter(s)	Masking threshold	Anchor frame threshold	NMS threshold
				Default value	0.1	0.05	0.6

TABLE 1 parameter values

The corresponding threshold values can be in steps of 0.05, and the best combination is obtained through a few forward inferences on the verification set. Thus, the various import parameters involved in this implementation can be adjusted as appropriate based on the actual data set used.

The improved scheme provided by the invention can be suitable for the difficult target detection scenes such as dense and small objects. In experiments, the method can improve the average precision by about 8% compared with a bias weighting correlation method.

The scheme provided by the embodiment of the invention can better solve the problem of unbalance of the positive and negative samples in the depth target detector, thereby improving the accuracy of the target detection algorithm.

Through the above description of the embodiments, it is clear to those skilled in the art that the above embodiments can be implemented by software, and can also be implemented by software plus a necessary general hardware platform. With this understanding, the technical solutions of the embodiments can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments of the present invention.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for detecting an object with improved class imbalance, comprising:

adding a branch structure for generating a mask graph in the feature detection network to obtain the mask graph with the same size as the feature graph with the specified resolution;

training the mask image, distributing positive and negative labels for each mask lattice in the mask image according to a training result, training an anchor frame on the positive mask lattice, neglecting the anchor frame on the negative mask lattice, and not participating in training to finally obtain a trained target detection model;

in the testing stage, screening the anchor frame by using the trained target detection model to further obtain a target detection result;

wherein, adding a branch structure for generating a mask graph in the feature detection network to obtain the mask graph with the same size as the feature graph with the specified resolution comprises:

the feature detection network can extract a plurality of feature maps with different resolutions, down-sampling the feature map with each resolution by adding a pooling layer, reducing the feature map with each resolution to the same size as the feature map with the lowest resolution, and splicing all the feature maps to obtain a feature map A containing objects with different scales and semantic information;

converting the size of the feature map A into w × h × 1 by adding convolution layers, wherein w and h are the width and height of the feature map with the lowest resolution respectively;

adding a masking layer, and regarding the feature diagram A after size conversion as a masking diagram;

the training of the mask map and the allocating of positive and negative labels to each mask lattice in the mask map according to the training result comprise:

and training the mask graph by adopting a logistic regression mode, wherein the loss function of the logistic regression is expressed as follows:

L _mask ＝∑{y lnσ(x)-(1-y)ln[1-σ(x)]}

allocating positive and negative labels for each mask lattice according to the numerical value obtained by conversion and the allocation principle of the mask lattice labels; the distribution principle of the mask grid label is as follows: if the center of the real object box is in the mask grid, distributing the corresponding mask grid as a positive mask grid; if the object is completely outside the mask lattice, the corresponding mask lattice is assigned as a negative mask lattice.

2. The method for detecting an object with improved class imbalance of claim 1, wherein the method further comprises: the mask map is upsampled to extend it to all detection layers of the feature detection network.

3. The method for detecting the target with the improved class imbalance according to claim 1, is characterized in that anchor frame training is carried out according to the positive label type and the negative label type of a mask frame, and for the positive mask frame, corresponding anchor frames are distributed according to the overlapping ratio with real frames;

the overall loss function of the training anchor box can be expressed as follows:

L＝L _obj +L _mask +L _cls +αL _loc

wherein L is _obj For loss of likelihood of anchor-framed objects, L _mask For loss of mask map, L _cls To classify the loss, L _loc α is a weighting factor for the loss of bounding box regression.

4. The method for detecting the target with improved class imbalance according to claim 1, wherein in the testing stage, the screening of the anchor frame by using the trained target detection model to obtain the target detection result includes:

processing the input image through a trained target detection model to obtain a corresponding mask image;

according to the flow of the target detection algorithm, for the preliminary prediction result, a non-maximum suppression algorithm is adopted, redundant anchor frames are removed, and then a plurality of prediction frames with the category scores close to the ranks are selected as the target detection result.