CN110210482A

CN110210482A - Improve the object detection method of class imbalance

Info

Publication number: CN110210482A
Application number: CN201910494701.8A
Authority: CN
Inventors: 徐童; 陈卓; 刘�东; 陈恩红
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2019-06-05
Filing date: 2019-06-05
Publication date: 2019-09-06
Anticipated expiration: 2039-06-05
Also published as: CN110210482B

Abstract

The invention discloses a kind of object detection methods for improving class imbalance, comprising: adds the branched structure for generating mask figure in traditional characteristic detection network, obtains the mask figure for having identical size with the characteristic pattern of given resolution；Mask figure is trained, is the positive and negative label of each mask case distribution in mask figure according to training result, for positive mask case, anchor frame thereon is trained, for negative mask case, ignores anchor frame on it, it is not involved in training, finally obtains trained target detection model；In test phase, the screening of anchor frame is carried out using trained target detection model, and then obtains object detection results.This method can be applied on the object detector based on deep learning well, preferably solve the problems, such as that positive and negative sample is unbalanced in depth targets detector, to promote the accuracy of algorithm of target detection.

Description

Improve the object detection method of class imbalance

Technical field

The present invention relates to target detection technique field more particularly to a kind of object detection methods for improving class imbalance.

Background technique

Target detection is the basis in computer vision, its task is to orient all objects from an image, And complete the Classification and Identification to the object.It is on major internal authority data set the result shows that, it is existing based on deep The object detector of degree study is state-of-the-art detection method, these object detection methods use anchor frame to carry out detection block mostly Prediction.Above-mentioned anchor frame (anchor) can be understood as preset box, the anchor frame that object detector passes through a variety of patterns of setting Remove the object of covering different sizes.For the characteristic pattern of a W × H, if the anchor frame of 9 kinds of setting different patterns, W × H × 9 anchor frame is just had, from the perspective of input picture, the center of these anchor frames is equidistantly to be evenly distributed on image On.These object detectors of training surround how to train anchor frame and be unfolded, and such as will overlap bigger anchor with real-world object Frame regards as positive sample (positive anchor frame), returns to the coordinate of this anchor frame, and effort allows it close to true object area, and And use the classification of real-world object as supervision, classification task is carried out to anchor frame；Smaller anchor frame is overlapped with real-world object to regard For negative sample (negative anchor frame), ignore the task of its coordinate recurrence, and uses the classification task of " background " this kind of supervision anchor frames.

However, the training of anchor frame will be influenced by serious class imbalance in the training process of object detector.This It is the negligible amounts of positive anchor frame because the object in image is limited, and the quantity of negative anchor frame is made far more than the quantity of positive anchor frame It is uneven at positive negative ratio.When the type setting of anchor frame is less, positive negative ratio is likely to be breached 1:1000 or so, and when anchor frame When type setting is more, it is possible to reach 1:10000 or more.Such unbalanced ratio will greatly influence object detector Reliability, so that " background " classification is more partial in prediction, recall rate is reduced.

What the various variants that the existing method for solving class imbalance mainly passes through bias weight method were realized. It is such as sampled in training, it is specified that positive negative ratio is 1:3；Or when training network carries out backpropagation, only consider loss Biggish " difficult sample " balances positive negative ratio；Either for one adaptive weight of all sample weightings.It in this way can band Carry out two problems: (1) parameter introduced is difficult to illustrate how that setting is better, can only be chosen by didactic method, Effect depends on data set；(2) method sampled carrys out power-assisted in model training simply by the means of mitigation, and there is no substantially Change positive negative ratio of the anchor frame in training, training will be caused asynchronous with when test.

Summary of the invention

The object of the present invention is to provide a kind of object detection methods for improving class imbalance, promote the accurate of target detection Degree.

The purpose of the present invention is what is be achieved through the following technical solutions:

A kind of object detection method improving class imbalance, comprising:

The branched structure for generating mask figure is added in traditional characteristic detection network, obtains the spy with given resolution Sign figure has the mask figure of identical size；

Mask figure is trained, is the positive and negative label of each mask case distribution in mask figure according to training result, for Positive mask case is trained anchor frame thereon, for negative mask case, ignores anchor frame on it, is not involved in training, final to obtain To trained target detection model；

In test phase, the screening of anchor frame is carried out using trained target detection model, and then obtains target detection knot Fruit.

As seen from the above technical solution provided by the invention, anchor frame is solved in object detector compared to traditional The method of class imbalance, the present invention are more biased towards in fundamentally going to solve the problems, such as, reduction needs to consider in training and test Anchor frame quantity, and devise a simple and efficient mask mechanism, can apply well based on deep learning On object detector, preferably solve the problems, such as that positive and negative sample is unbalanced in depth targets detector, to promote target inspection The accuracy of method of determining and calculating.

Detailed description of the invention

In order to illustrate the technical solution of the embodiments of the present invention more clearly, required use in being described below to embodiment Attached drawing be briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for this For the those of ordinary skill in field, without creative efforts, it can also be obtained according to these attached drawings other Attached drawing.

Fig. 1 is a kind of flow chart of object detection method for improving class imbalance provided in an embodiment of the present invention；

Fig. 2 is the structure chart of training stage mask mechanism provided in an embodiment of the present invention；

Fig. 3 is the schematic diagram provided in an embodiment of the present invention that anchor point frame training is instructed using mask figure.

Specific embodiment

With reference to the attached drawing in the embodiment of the present invention, technical solution in the embodiment of the present invention carries out clear, complete Ground description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.Based on this The embodiment of invention, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, belongs to protection scope of the present invention.

The phenomenon that proportional imbalance positive and negative towards anchor point in the object detection method based on deep learning, is different from general inclined Method of weighting is set, present invention introduces mask mechanism, propose a kind of object detection method for improving class imbalance, cardinal principle Are as follows: to create a 2D mask figure, the width of mask figure and high width and Gao Xiangtong with characteristic pattern for the characteristic pattern of detection.Mask Figure is made of grid one by one, referred to as mask case；Mask case can provide the score between prediction 0 to 1 accordingly, judge this It whether include positive anchor frame in region, positive mask case is counted as the mask case that is positive in training, and the higher explanation of score is comprising just covering The probability of code lattice is bigger；Only select that the anchor frame in positive mask case is trained in training；In test according to mask score The classification score that provides with anchor frame obtains prediction block, to complete target detection.

As shown in Figure 1, a kind of object detection method for improving class imbalance provided in an embodiment of the present invention mainly includes Following steps:

Step 1 detects the branched structure added in network for generating mask figure in traditional characteristic, obtains differentiating with specified The characteristic pattern of rate has the mask figure of identical size.

In the embodiment of the present invention, traditional backbone network structure is improved, the backbone network is target detection Convolutional neural networks in model, it will be appreciated that for the network for extracting detection feature.Spy of the backbone network in different resolution It levies and is provided with different detection layers on figure, for detecting the different object of scale.It is done accordingly by the structure for backbone network Improvement, can more agree with the present invention using mask figure Further aim and detect unbalanced method.

As shown in Fig. 2, the preferred embodiment for improving structure is as follows:

1) traditional characteristic detection network can extract the characteristic pattern of several different resolutions, by addition pond layer to each The character figure of resolution ratio carries out down-sampling, and the characteristic pattern of each resolution ratio is narrowed down to the characteristic pattern with lowest resolution Identical size, then all characteristic patterns are spliced.After carrying out pond-concatenation, so that it may obtain comprising different rulers Object is spent, semantic information characteristic pattern abundant, on this characteristic pattern, the object information of different scale can all have corresponding table It reaches, the characteristic pattern that Jiang Chihua-concatenation obtains is denoted as characteristic pattern A.

2) make size change over w × h × 1 of characteristic pattern A by adding convolutional layer, wherein w and h be respectively resolution ratio most The width of low characteristic pattern and height；It adds convolutional layer and on the one hand reduces calculation amount for reducing dimension；It on the other hand is that addition is desired Study mapping, allows it that can express object information on spatial position.

3) mask layer is added, regards the characteristic pattern A after size change over as mask figure.There is w × h grid in mask figure, often One grid is trained and (will be introduced in subsequent step 2) using logistic regression.So far, backbone network structure is completed Modification.It should be noted that above-mentioned modification can also suitably be repaired according to actual conditions.

Step 2 is trained mask figure, is the positive and negative mark of each mask case distribution in mask figure according to training result Label, for positive mask case, are trained anchor frame thereon, for negative mask case, ignore anchor frame on it, be not involved in instruction Practice, finally obtains trained target detection model.

This step preferred embodiment is as follows:

1) mask figure training.

In the embodiment of the present invention, for the mask figure that above-mentioned steps 2 obtain, using the training of logistic regression mode, logic is returned The loss function returned is expressed as follows:

L_mask=∑ { yln σ (x)-(1-y) ln [1- σ (x)] }

Wherein, y is the label of mask case, and x is the output of mask case, and σ (x) expression becomes x using sigmoid function It changes, makes its output between 0 to 1, the numerical value converted indicates that the center of the true box of object is fallen in inside corresponding mask case Probability；

The distribution principle of the numerical value and mask case label that obtain according to transformation is that each mask case distributes positive and negative label；Its In, the distribution principle of mask case label are as follows: if the center of the true box of object is fallen in mask case, distributing corresponding mask case is Positive mask case；If this object, completely except mask case, this distributes corresponding mask case and is negative mask case.

Above-mentioned mask figure is the characteristic pattern of lowest resolution, in order to be applied to multiple detection layers, training process In and subsequent introduction test phase, carrying out up-sampling operation to mask figure makes it extend to traditional characteristic detection network All detection layers.

It will be understood by those skilled in the art that the effect of detection layers is the classification score for generating corresponding anchor frame using feature (classification) adjusts the coordinate transform (coordinate recurrence) of anchor frame；Work is not involved in feature extraction phases detection layers.

In the embodiment of the present invention, it is desirable to generate a mask figure, quickly filter " negative sample ".Existing target detection side " multilayer pyramid " prediction mode is mostly used in method, therefore is up-sampled herein, is applied on each layer, so as to It plays a role in training and test.

Top sampling method is as follows: the filling of each mask case of the mask figure after expansion is that the preceding mask figure of expansion is corresponding The duplication of positional value；In the training process, up-sampling is also done by for the label distribution of mask case；Therefore, mask figure All detection layers can be extended to by way of up-sampling.

2) anchor frame training

After having mask figure, for anchor frame training also different from general method, in the embodiment of the present invention, according to covering Code lattice positive and negative tag types carry out the training of anchor frame, as shown in figure 3, for positive mask case (in such as Fig. 3 three with compared with thick lines mark The region of 2 × 2 sizes of note), corresponding anchor frame according to true box is overlapping is allocated than (IoU)；And for negative mask Lattice then directly ignore anchor frame on it, are not involved in training.So, the negative anchor frame quantity for participating in training will greatly reduce, The positive negative ratio of serious unbalance will greatly be mitigated.

The overall loss function of training anchor frame can be expressed as follows:

L=L_obj+L_mask+L_cls+αL_loc

Wherein, L_objFor the loss of anchor frame object likelihood, L_maskFor the loss of mask figure, L_clsFor Classification Loss, L_locFor The loss that bounding box returns, α is weight factor, may be configured as 2, indicates that model is more prone to first guarantee the accuracy of positioning.

Step 3, in test phase, the screening of anchor frame is carried out using trained target detection model, and then obtain target Testing result.

The preferred embodiment of this step is as follows:

Two threshold values are set, and first threshold value is for screening mask case, and second threshold value is for screening anchor frame；

Input picture is handled by trained target detection model, obtains corresponding mask figure；

By first threshold value, corresponding mask case is filtered out from mask figure；Pass through second threshold value covering from screening Corresponding anchor frame is filtered out in code lattice, to obtain tentative prediction result；

According to the process (can be realized by routine techniques) of algorithm of target detection, for tentative prediction as a result, using non-pole Big value restrainable algorithms, remove redundancy anchor frame, then select category score several prediction blocks in the top, as target detection knot Fruit.

The present invention is described in detail below with reference to example；It is worth noting that, network type involved in following example, The specific value etc. of threshold value is citing, is not construed as limiting.

In this example, for the structure for improving backbone network；In model training stage, mask figure is set；In model reasoning rank Section combines three aspects of mask figure output test result to be introduced.

(1) structure of backbone network is improved:

1. the character network based on Darknet-53 network is improved, as shown in Fig. 2, selection is at 74 layers, 61 layers, 36 layers Pond layer is added on the layer of three different scales, to find object using the feature on different scale.With input For 512 × 512, the resolution ratio on these three scales is respectively 16 × 16,32 × 32,64 × 64.Then, we Connecting window respectively for these three layers is 4, step-length 4, window 3, step-length 2, window 2, the maximum pond that step-length is 1 Layer, is all reduced to 16 × 16 for the resolution ratio of their characteristic patterns.Then, we again carry out these characteristic patterns on dimension direction Splicing, it is lower to merge into a resolution ratio, but dimension is wider, the characteristic pattern rich in semantic information；

It should be noted that the Darknet-53 network that the present invention uses is only one of many backbone networks, It can be carried out using other backbone networks, such as ResNet-101, VGG-16 according to the above-mentioned improvement thought of the present invention Corresponding conversion.

2. 11 × 1 convolution sum 13 × 3 convolution is added after characteristic pattern, to reduce dimension described in 1. Degree and expansion are semantic.Finally, the convolution using one 1 × 1 generates mask figure.For the structure of the network of script, do not carry out Any change is only the branch for being added to a generation mask.

It should be noted that the convolutional layer added behind can suitably increase and can further lift scheme effect, The present invention only discusses most simple way of realization；

3. continuing to add mask layer, having w × h grid in mask figure, each grid for the mask figure belonging in 2. It is trained using logistic regression.So far, the modification of backbone network structure is completed.

(2) mask figure is set in model training stage:

As shown in Fig. 2, the dimension of mask figure is 1, each pixel above is referred to as grid, the value generation of grid Table is fallen in inside this grid either with or without the center of object.Therefore, for the true box in original image, it is closed according to size It is the wide high and corresponding coordinate of true box in available and mask figure equal resolution after scaled down.Calculate this The central point of box, and centered on put the grid distribution fallen in and be positive label, for not comprising the grid of object, for its point With negative label.Each grid uses logistic regression to be trained.

Using up-sampling, the mask figure of lowest resolution can be applied in the detection layers of all resolution ratio.Such as 16 × 16 mask figure is up-sampled to 32 × 32 mask figure, and the grid in 16 × 16 mask figure upper left corners will correspond to 32 × 32 and cover originally 2 × 2 grids in code figure, numerical value are filled into 32 × 32 mask figure according to the method for duplication.According to this simple up-sampling Mode, mask label can be extended in big resolution ratio, it is hereby achieved that different on big resolution ratio mask figure cover Code lattice attribute.

By up-sampling, mask figure is assigned to the detection layers under each resolution ratio.Next it introduces and how to utilize mask Lattice instruct the training of anchor frame:

As shown in figure 3, for positive mask case, anchor frame still according to true box overlap ratio number (IoU) divided Match；And for negative mask case, then directly ignore anchor frame on it, is not involved in training.So, the negative anchor frame of training is participated in Quantity will greatly reduce, and the positive negative ratio of serious unbalance will greatly be mitigated.According to statistics, it is having a size of 512 × 512 with input Example, the positive negative ratio in a batch can reach 1:4000 originally, and will be maintained at after having used object likelihood mask 1:30 or so greatly alleviates extremely unbalanced positive negative ratio.

(3) mask figure is combined in the model reasoning stage

When model forward inference is used for test image, the grid that score is too small on mask figure is filtered out first.If covered Code score is lower than 0.1 (first threshold value), then this grid is sifted out；Next, being screened in remaining mask case Anchor frame.If this anchor frame is sifted out, is obtained preliminary less than 0.05 (second threshold value) by its object likelihood score Prediction result.Next, for preliminary prediction result, being calculated using non-maxima suppression according to the process of algorithm of target detection After method, sorts to it and select preceding 100 prediction blocks, obtain final testing result.

It should be noted that herein for screening mask threshold value and anchor frame threshold value can according to different data sets, Leading to after too small amount of forward inference in test phase can determine.

By comparative experiments it is found that improved model is relative to former mould after alleviating imbalance using mask mechanism Type and compare, higher score can be obtained on object likelihood；And it can show that higher recall rate, improve detection Accuracy.

In order to make it easy to understand, doing necessary supplement below for the above process:

(1) foundation of basic network

Darknet-53 network is basic network used in YOLOv3, it includes 53 convolutional layers, a large amount of 3 × 3 and 1 × 1 convolutional layer is used in network.Its accuracy rate on ImageNet data set is quite similar with ResNet-101, this says The feature representation ability of the basic network of bright Darknet-53 and ResNet-101 is essentially identical.In the selection for carrying out backbone network When, it can be foundation, the model foundation backbone network for selecting accuracy rate bigger, phase according to the accuracy rate on ImageNet data set The testing result answered may also can be more preferable.

(2) training of model parameter

As shown in figure 3, being visualized to how object likelihood mask supervision figure is applied to detection layers in the training stage. For positive mask case, the anchor frame being used only in the mask case is trained.It is described above using the value of mask case and is distributed The positive and negative label of mask case, to filter out positive mask case, which can be understood as preliminary screening, and screening is cell. After mask case screening, we only need to consider the anchor frame in remaining cell；It, can be in the following manner for anchor frame Be allocated: when anchor frame is greater than threshold value with the overlapping ratio of corresponding object, we will distribute positive label, when overlapping than being less than threshold value, We will distribute negative label.This threshold value generally may be configured as 0.5.

Further, it is also possible to be taken at random using the flip horizontal generallyd use in object detection method, colour switching is more The data enhancement methods such as scale training help to be promoted the robustness of detector.

(3) model forward inference

The mask threshold value (i.e. first threshold value) that can be set is 0.1, and object likelihood threshold value (i.e. second threshold value) is 0.05 helps to filter out useless prediction block.Meanwhile we using 0.6 non-maxima suppression threshold value (i.e. NMS threshold value) come Carry out the removal of redundancy frame.In implementation process, it can also help to mention using the skill of fusion convolutional layer and batch normalization layer Lifting speed.

As specified above, three parameters for needing to adjust are introduced in the model of addition altogether, these parameters are only It needs to be adjusted in test, and is not necessarily to be adjusted in training, be listed as follows:

Parameter	Mask threshold value	Anchor frame threshold value	NMS threshold value
				Default value	0.1	0.05	0.6

1 parameter value of table

Corresponding threshold value can be step-length according to 0.05, obtain optimal group by a small number of forward inferences on verifying collection It closes.Therefore, various introducing parameters involved in this implementation, can do according to the actual conditions of data set used with tune appropriate It is whole.

Improved scheme provided by the invention can adapt to the more difficult target detection scenes such as intensive, wisp.In reality In testing, about 8% or so mean accuracy can be promoted compared with the correlation technique of bias weight.

It is uneven can preferably to solve the positive and negative sample in depth targets detector for scheme provided in an embodiment of the present invention The problem of, to promote the accuracy of algorithm of target detection.

Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment can The mode of necessary general hardware platform can also be added to realize by software by software realization.Based on this understanding, The technical solution of above-described embodiment can be embodied in the form of software products, which can store non-easy at one In the property lost storage medium (can be CD-ROM, USB flash disk, mobile hard disk etc.), including some instructions are with so that a computer is set Standby (can be personal computer, server or the network equipment etc.) executes method described in each embodiment of the present invention.

The foregoing is only a preferred embodiment of the present invention, but scope of protection of the present invention is not limited thereto, Within the technical scope of the present disclosure, any changes or substitutions that can be easily thought of by anyone skilled in the art, It should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with the protection model of claims Subject to enclosing.

Claims

1. a kind of object detection method for improving class imbalance characterized by comprising

The branched structure for generating mask figure is added in traditional characteristic detection network, obtains the characteristic pattern with given resolution Mask figure with identical size；

Mask figure is trained, is the positive and negative label of each mask case distribution in mask figure according to training result, for just covering Code lattice, are trained anchor frame thereon, for negative mask case, ignore anchor frame on it, be not involved in training, finally obtain instruction The target detection model perfected；

In test phase, the screening of anchor frame is carried out using trained target detection model, and then obtains object detection results.

2. a kind of object detection method for improving class imbalance according to claim 1, which is characterized in that described to pass The branched structure for generating mask figure is added in system feature detection network, obtains having with the characteristic pattern of given resolution identical The mask figure of size includes:

The traditional characteristic detection network can extract the characteristic pattern of several different resolutions, by addition pond layer to each point The character figure of resolution carries out down-sampling, and the characteristic pattern of each resolution ratio is narrowed down to the characteristic pattern phase with lowest resolution Same size, then all characteristic patterns are spliced, obtain the characteristic pattern A comprising different scale object and semantic information；

Make size change over w × h × 1 of characteristic pattern A by adding convolutional layer, wherein w and h is respectively the minimum feature of resolution ratio The width of figure and height；

Mask layer is added, regards the characteristic pattern A after size change over as mask figure.

3. a kind of object detection method for improving class imbalance according to claim 1, which is characterized in that described pair is covered Code figure is trained, and is distributed positive and negative label according to training result for each mask case in mask figure and is included:

Using logistic regression mode training mask figure, the loss function of logistic regression is expressed as follows:

L_mask=∑ { y ln σ (x)-(1-y) ln [1- σ (x)] }

Wherein, y is the label of mask case, and x is the output of mask case, and σ (x) expression converts x using sigmoid function, Make its output between 0 to 1, it is general inside corresponding mask case that the numerical value converted indicates that the center of the true box of object is fallen in Rate；

The distribution principle of the numerical value and mask case label that obtain according to transformation is that each mask case distributes positive and negative label；Wherein, it covers The distribution principle of code case marker label are as follows: if the center of the true box of object is fallen in mask case, distribute corresponding mask case and be positive and cover Code lattice；If this object, completely except mask case, this distributes corresponding mask case and is negative mask case.

4. a kind of object detection method for improving class imbalance according to claim 1 or 3, which is characterized in that the party Method further include: carrying out up-sampling operation to mask figure makes it extend to all detection layers of traditional characteristic detection network.

5. a kind of object detection method for improving class imbalance according to claim 1, which is characterized in that according to mask The positive and negative tag types of lattice carry out the training of anchor frame, and for positive mask case, corresponding anchor frame is according to overlapping than carrying out with true box Distribution；

The overall loss function of training anchor frame can be expressed as follows:

L=L_obj+L_mask+L_cls+αL_loc

Wherein, L_objFor the loss of anchor frame object likelihood, L_maskFor the loss of mask figure, L_clsFor Classification Loss, L_locFor boundary The loss that frame returns, α is weight factor.

6. a kind of object detection method for improving class imbalance according to claim 1, which is characterized in that described to survey In the examination stage, the screening of anchor frame is carried out using trained target detection model, and then obtain object detection results and include:

By first threshold value, corresponding mask case is filtered out from mask figure；By second threshold value from the mask case of screening In filter out corresponding anchor frame, to obtain tentative prediction result；

According to the process of algorithm of target detection, for tentative prediction as a result, using non-maxima suppression algorithm, redundancy anchor is removed Frame, then category score several prediction blocks in the top are selected, as object detection results.