CN112115977A

CN112115977A - Target detection algorithm based on scale invariance and feature fusion

Info

Publication number: CN112115977A
Application number: CN202010856245.XA
Authority: CN
Inventors: 周轩弘; 李季
Original assignee: Chongqing University
Current assignee: Chongqing University
Priority date: 2020-08-24
Filing date: 2020-08-24
Publication date: 2020-12-22
Anticipated expiration: 2040-08-24
Also published as: CN112115977B

Abstract

A target detection algorithm based on scale invariance and feature fusion adopts the following steps: the method comprises the following steps: inputting an image to be detected into detnet59 for feature extraction to obtain a plurality of feature maps; step two: selecting a mode of fusing features for the obtained feature maps to obtain a plurality of new feature maps with the same channel; step three: and generating a candidate frame by using the plurality of feature maps, and performing multiple selection classification and regression on the candidate frame.

Description

Target detection algorithm based on scale invariance and feature fusion

Technical Field

The invention relates to the technical field of target detection, in particular to a target detection algorithm based on scale invariance and feature fusion.

Background

With the continuous development of deep learning technology, more and more target detection methods are available. A large number of targets exist in an image, and classification and detection of each target are difficult, especially for some small targets, so that detection of the small targets is a key area in the field of target detection at present.

The target detection is a complex and important task, and has great effects on military affairs, medical treatment, life and other aspects. Existing target detection techniques are mainly divided into two types: the method comprises the following steps of firstly, based on a traditional method of manually labeling features, such as a Hear feature, an Adaboost algorithm, an SVM algorithm and a DPM algorithm; the second is a method based on deep learning technology. Under deep learning, target detection is mainly divided into the following two tasks: one is the prediction of the frame, marking the up, down, left, right position of each object. The other is a prediction of the class, which predicts to which object each pixel belongs. And because the steps are different, the target detection is divided into two-stage detection and single-stage target detection. Representative papers for two-stage object detection are mainly the RCNN series, i.e., object candidate regions (Region probes) are generated and then corrected. The representative papers of single-stage target detection are mainly the YOLO and SSD series, i.e. the position of the frame is directly predicted through the network. Generally, the precision of target detection in two stages is higher than that of target detection in a single stage, and the precision of target detection in a single stage is not as high as that in two stages, but the detection speed is higher under the condition that certain precision is ensured. However, both methods have a scaling problem. Because the two methods are based on larger down-sampling factors and generate higher receptive fields to obtain more semantic information, the method is beneficial to large object recognition. However, downsampling necessarily suffers from a loss of spatial resolution, and as downsampling is larger, resolution is smaller, and small object recognition is more difficult. In order to solve the problem of scale transformation caused by down-sampling, a common method is multi-scale feature fusion. The FPN uses the method for the first time, and the lower-layer features are fused with the higher-layer features to obtain more semantic information through a top-down idea. Then, the PANET is improved on the basis of FPN, and is added with a bottom-up thought, and is gradually sampled from the lower-layer features to the resolution of the higher-layer features, and is fused with the higher-layer features, so that the higher-layer features also have the spatial information of the lower-layer features. However, the method has the defects that different layers have different sensitivities to different scales, and even though the high-layer features are fused with the spatial information of the low-layer features, the semantic information of the low layer is brought at the same time, so that the trained high-layer features can be influenced, and the classification and prediction capabilities of the high-layer features on large objects are weakened.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a target detection algorithm based on scale invariance and feature fusion, which solves the problem of scale change in the existing target detection method, improves the detection of small targets and large targets, and has the following specific technical scheme:

a target detection algorithm based on scale invariance and feature fusion adopts the following steps:

the method comprises the following steps: inputting an image to be detected into detnet59 for feature extraction to obtain a plurality of feature maps;

step two: selecting a mode of fusing features for the obtained feature maps to obtain a plurality of new feature maps with the same channel;

step three: and generating a candidate frame by using the plurality of feature maps, and performing multiple selection classification and regression on the candidate frame.

As an optimization: the detnet59 is a modified detnet59, the modified detnet59 and the detnet59 have the same first step to the fifth step, respectively generate 1-5 feature maps, the fifth step is started, the 5 th feature map is used and is divided into three branches, and the 6 th-8 th feature map is generated, the resolution of the 6 th feature map is kept the same as that of the 5 th feature map, the perceptive field is kept different by using dilation convolution, the 7 th and 8 th feature maps are used for reducing the resolution and increasing semantic information, and then the perceptive field of the 7 th and 8 th feature maps is increased by using dilation convolution.

As an optimization: the second step is to select a fusion characteristic mode specifically;

step 2.1: the feature maps of the 2 nd to the 8 th are changed into the feature map of the channel 256 through convolution operation, wherein the feature maps of the 6 th to the 8 th are generated into P6-P8;

step 2.2: after the feature maps of 7 and 8 are subjected to upsampling, the feature maps and the feature map of 6 are fused into a feature map 5, and after the fusion, each fusion result is convolved to generate P5;

step 2.3: performing upsampling fusion on the P5 to the feature map 4, and performing convolution on each fusion result after the fusion to generate P4;

step 2.4: and repeating the step 2.3 until the feature map 2 is fused, and generating P2 and P3.

As an optimization: the third step is specifically that the first step is,

step 3.1: for the P2, P3, P4, P5, P6, P7, and P8 layers, a large amount of anchors are generated;

step 3.2: for the three layers of P6, P7 and P8, the anchorages and ground channels generated by the three layers are screened according to a function l _ i ≦ v/wh h [ (u) ≦ u ] and i represents the minimum width value, u _ i represents the maximum width value, w and h represent the height and width of the frame respectively, P6 retains only small anchorages, P7 retains only medium anchorages, and P8 retains only large anchorages; then, suppressing and generating the candidate frames of the first part by using an NMS non-maximum value with an IoU threshold value of 0.5 for the anchor, and then carrying out classification and border regression on the candidate frames of the first part; the value of the IOU is the intersection of the two prediction boxes divided by the union of the two prediction boxes; NMS compares all the frames one by one, if the intersection of the two frames is larger than the threshold value set by IOU, then the frame with the maximum score is kept, and the other frames are deleted; obtaining a first partial candidate frame; p6 only needs to return loss to small group route, P7 only needs to return loss to medium group route, and P8 only needs to return loss to large group route;

step 3.2: after the regressed candidate frames of the first part are obtained, inhibiting and generating the candidate frames of the second part by using the NMS non-maximum value with the threshold value of 0.6, and then classifying and frame regressing the candidate frames of the second part;

step 3.3: and after the candidate frames after the second part of regression are obtained, inhibiting and generating a final candidate frame by using the NMS non-maximum value with the threshold value of 0.7, and then classifying and performing frame regression on the final candidate frame.

As an optimization: the classifying the candidate box may include classifying the candidate box,

mapping the features corresponding to the candidate frames to a (0, 1) interval by utilizing a softmax function in classification, wherein the features correspond to n categories, n is an integer greater than 1, and the category with the highest probability is a predicted category;

where Si represents the probability for the class, ei represents the prediction score for the class, and Σ jej represents the sum of all class scores.

As an optimization: the regression of the final candidate box comprises:

the regression utilizes a DIoU loss function to calculate the scale, the overlapping rate and the distance between the candidate frame and the target;

IoU represents the intersection ratio of the target frame and the candidate frame, b represents the center point of the candidate frame, bgt represents the center point of the target frame, ρ represents calculation of the European sniping between the two center points, and c represents the diagonal distance of the minimum closure area which can contain the candidate frame and the target frame at the same time.

The invention has the beneficial effects that: inputting the image into a deep neural network for feature extraction, obtaining a feature map with scale invariance, screening the feature map, obtaining a plurality of candidate frames, generating candidate frames by using the feature map, then carrying out maximum value inhibition with the cross-over ratio of 0.5 on the candidate frames to select a first part of candidate frames, classifying and regressing the first part of candidate frames to obtain new candidate frames, and then carrying out maximum value inhibition with the cross-over ratio of 0.6 on the new candidate frames to obtain a second part of candidate frames. And classifying and regressing the second part of candidate frames to obtain new candidate frames, and performing maximum value suppression with the intersection ratio of 0.7 on the new candidate frames to obtain final candidate frames.

Drawings

FIG. 1 is a flow chart of a target detection algorithm based on scale invariance and feature fusion in accordance with the present invention;

FIG. 2 is a diagram of a multi-drop detnet network architecture in accordance with the present invention;

FIG. 3 is a diagram of an alternative fusion architecture in accordance with the present invention;

FIG. 4 is a graph of predicted object size for each branch in the present invention;

FIG. 5 is a diagram of multiple classification and regression of candidate frames in accordance with the present invention;

FIG. 6 is a diagram of a network architecture according to the present invention;

Detailed Description

The following detailed description of the preferred embodiments of the present invention, taken in conjunction with the accompanying drawings, will make the advantages and features of the invention easier to understand by those skilled in the art, and thus will clearly and clearly define the scope of the invention.

The hardware equipment used by the invention comprises 1 PC and 1 nvidia1080Ti video card;

as shown in fig. 1: a target detection algorithm based on scale invariance and feature fusion comprises the following steps:

s1, inputting the image to be detected into an improved detnet59 for feature extraction to obtain a plurality of feature maps;

s2, carrying out a mode of selecting and fusing the characteristics on the obtained characteristic graphs to obtain a plurality of new characteristic graphs with the same channel;

and S3, generating a candidate frame by using the plurality of feature maps, and performing multiple selection classification and regression on the candidate frame.

Improved detnet59 network architecture:

referring to fig. 2, the improved detnet59 network structure used, using a convolution operation with step size 2 each time on the input pictures, produces 4 layers of feature maps C2, C2, C3, C4 of different sizes. 36 expansion convolutions are used, and a characteristic diagram is taken after every 9 expansion convolution layers and is divided into C5, C6, C7 and C8. And C5 uses the dilation convolution with dilation rate of 2, and C6 uses the dilation convolution with dilation rate of 2 on the basis of C5, so that a receptive field different from C5 is obtained. C7 is a method in which C5 is first convolved at a step size of 2 so that the image size is reduced in addition to C5, and the reduced image is then convolved with a dilation rate of 2 so that a receptive field different from C6 is obtained. C8 is also based on C5, C5 is first convolved with a step size of 2 to reduce the image size, and the reduced image is then convolved with a dilation rate of 2 to obtain a receptive field different from that of C6 or C7.

Selecting fusion:

referring to fig. 3, using the feature maps { C2, C3, C4, C5, C6, C7, and C8} extracted in the first step, a convolution operation with 1 × 1 convolution kernel 256 is used to generate { C2_ reduced, C3_ reduced, C4_ reduced, C5_ reduced, P6, P7, and P8} for all feature maps; { P7, P8} is processed by a bilinear interpolation to become { P7_ upsampled, P8_ upsampled }, C5_ reduced and { P7_ upsampled, P8_ upsampled, C6_ reduced } are processed by add convolution fusion to generate P5_ clustered, P5_ clustered is processed by a convolution with a 3 ^ 3 convolution kernel of 256 to obtain P5; p5 is processed by bilinear interpolation to become P5_ upsampled, C4_ reduced and P5_ upsampled are processed by add convolution fusion to generate P4_ clustered, and P4_ clustered is processed by convolution with 3 × 3 convolution kernel being 256 to obtain P4;

in the same way, P4 is processed by bilinear interpolation to become P4_ upsamplated, C3_ reduced and P4_ upsamplated are processed by add convolution and fusion to generate P3_ clustered, and P3_ clustered is processed by convolution with 3 × 3 convolution kernel being 256 to obtain P3; p3 is processed by bilinear interpolation to become P3_ upsampled, C2_ reduced and P3_ upsampled are processed by add convolution fusion to generate P2_ clustered, and P2_ clustered is processed by convolution with 3 × 3 convolution kernel being 256 to obtain P2;

prediction of anchor:

referring to FIG. 4, { P6, P7, P8} is fed into the RPN network, for which anchors and ground nodes are generated according to

The function is screened, and P6 is only retained in l_i,u_iAt [0,90 ]]The range anchors, P7 remain only at l_i,u_iAt [30,160 ]]The range anchors, P8 remain only at l_i,u_iAt [90, ∞ ]]Anchors within the range. The anchors with corresponding sizes are predicted respectively. And { P2, P3, P4, P5} predicts the anchors of all scales

Multiple classification and regression of candidate frames:

referring to fig. 5, NMS non-maximum with threshold IoU of 0.5 is used to suppress the generation of candidate boxes for the first portion, and then the candidate boxes for the first portion are classified and bounding box regressed. And after the regressed candidate frames are obtained, inhibiting and generating the candidate frames of the second part by using the NMS non-maximum value with the threshold value of IoU being 0.6, and classifying and frame regressing the candidate frames of the second part. And after the regressed candidate frames are obtained, inhibiting and generating the candidate frames of the final part by using the NMS non-maximum value with the threshold value of IoU being 0.7, and classifying and performing border regression on the candidate frames of the final part. All classifications use the softmax function and all regressions are DIoU loss functions.

FIG. 6 is a block diagram of the overall network used in the patent

Training target detection network

And loading an image pre-training model, freezing parameters of the characteristic extraction part of the network, only training the network after the parameters are frozen, and performing next training after the best result is achieved.

Claims

1. A target detection algorithm based on scale invariance and feature fusion is characterized by comprising the following steps:

2. The target detection algorithm based on scale invariance and feature fusion of claim 1, wherein: the detnet59 is a modified detnet59, the modified detnet59 and the detnet59 have the same first step to the fifth step, respectively generate 1-5 feature maps, the fifth step is started, the 5 th feature map is used and is divided into three branches, and the 6 th-8 th feature map is generated, the resolution of the 6 th feature map is kept the same as that of the 5 th feature map, the perceptive field is kept different by using dilation convolution, the 7 th and 8 th feature maps are used for reducing the resolution and increasing semantic information, and then the perceptive field of the 7 th and 8 th feature maps is increased by using dilation convolution.

3. The target detection algorithm based on scale invariance and feature fusion of claim 1, wherein: the second step is to select a fusion characteristic mode specifically;

4. The target detection algorithm based on scale invariance and feature fusion of claim 1, wherein: the third step is specifically that the first step is,

5. The target detection algorithm based on scale invariance and feature fusion of claim 4, wherein: the classifying the candidate box may include classifying the candidate box,

6. The target detection algorithm based on scale invariance and feature fusion of claim 4, wherein: the regression of the final candidate box comprises: