CN112348036A

CN112348036A - Self-adaptive target detection method based on lightweight residual learning and deconvolution cascade

Info

Publication number: CN112348036A
Application number: CN202011342607.XA
Authority: CN
Inventors: 刘芳; 韩笑; 孙亚楠
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2020-11-26
Filing date: 2020-11-26
Publication date: 2021-02-09

Abstract

The invention discloses a self-adaptive target detection method based on a lightweight residual error network and deconvolution cascade, which comprises the following steps: acquiring an image training data set and a test data set; extracting deep level characteristics of the image to be detected through a lightweight residual error network combining depth separable convolution and residual error learning to obtain deep level expression of a target; fixing the dimension of the output feature map on the extracted different-level feature maps by adopting 1x1 convolution; increasing the resolution of the deep level feature map by using a deconvolution cascade structure to achieve the spatial dimension consistency with the previous level feature map; utilizing semantic features to guide a candidate region generation network to generate a target candidate frame which is matched with a real target in a self-adaptive mode on a multi-scale feature map; and finally, correcting the generated target candidate frame Anchor. The invention effectively improves the accuracy of target detection, can quickly and accurately detect the target under complex conditions, and effectively improves the real-time property of target detection.

Description

Self-adaptive target detection method based on lightweight residual learning and deconvolution cascade

Technical Field

The invention relates to a target detection method, belongs to the field of digital image processing, deep learning and artificial intelligence, and particularly designs a self-adaptive target detection method based on lightweight residual learning and deconvolution cascade.

Background

With the rapid development of computer vision technology, target detection technology has become a research hotspot in the fields of artificial intelligence and computer vision, and is widely applied to the military and civil fields. The target detection mainly aims at one or more specific targets in the video image sequence, and identifies and positions the specific targets. In most cases, video image acquisition equipment contains rich visual content, and although more comprehensive scene information can be provided, the target to be detected usually has large scale change, concentrated distribution and shielding in an image or a video, and sufficient detection details are not available, so that the target characteristics cannot be effectively extracted by the current target detection algorithm, and the target position can be accurately positioned. Therefore, accurately and efficiently detecting a target object is one of the key issues of a target detection task.

In recent years, a target detection technique based on deep learning has been highly successful, and many researchers have started studying target detection using a deep learning method. Currently, mainstream target detection algorithms are mainly classified into two types, one is single-stage target detection based on a regression method, and the other is two-stage target detection based on a region candidate box. The former is mainly represented by a YOLO series, and the detection idea is to regard the detection problem as a regression analysis problem of target position and target category information and directly output the detection result through a convolutional neural network; the latter is mainly represented by the R-CNN series, as the name implies. The method divides the target detection process into two main stages, wherein a candidate area extraction module is the first part and extracts network characteristics through a main stream backbone network to detect background and foreground areas, and the second stage is to classify and correct coordinates of the candidate areas to finish accurate detection of targets. The former has higher speed but unsatisfactory precision; the latter requires two convolution network operations, which undoubtedly results in a two-stage detection network with higher detection accuracy, but reduces the detection speed to some extent. However, with the development of convolutional neural networks, the appearance of various lightweight backbone networks (such as shuffle net, MobileNet, etc.), convolution modes (such as deep convolution, separable convolution, point convolution, etc.), and different Connection modes (such as Skip Connection, etc.) makes the network complexity and the calculation complexity decrease continuously, and meanwhile, hardware devices are continuously developed, which also lays a foundation for the improvement of the target detection speed. In addition, with the wide application of the convolutional neural network, the deconvolution also enters the sight of people, the deconvolution is used as the inverse process of the convolution, the problems of reduced resolution of a feature map, feature loss and the like caused by deep convolution operation can be effectively solved, and the method is an important means for carrying out multi-scale feature fusion.

The existing method has the following defects: on one hand, the traditional classical target detection algorithm is limited by manually designed manual features and a selective search algorithm, so that the target detection precision is low, the detection speed is low, and the algorithm robustness is poor; on the other hand, although the target detection based on deep learning has improved precision, the convolutional neural network has a large number of parameters, the algorithm structure has high complexity, the calculation amount is large, and the real-time requirement is difficult to meet.

Disclosure of Invention

The invention aims to solve the defects and provides a self-adaptive target detection algorithm based on lightweight residual learning and deconvolution cascade. The advantages of residual error learning are combined, and common convolution operation is divided into a depth convolution layer and a point convolution layer to be used for compressing network parameters, so that the calculation efficiency of the network is improved. Then, a multi-scale self-adaptive candidate area generation network is constructed on the basis of the lightweight residual error network, high-level semantic features are added and fused into a low-level feature map through a deconvolution cascade structure, the expression capability of the features on a target is enhanced, multi-level different-scale feature maps are used for target prediction, and sparse candidate frames with arbitrary shapes are generated according to the positions and the shapes of the candidate frames predicted by the image features, so that better target detection performance is achieved.

In order to achieve the above object, the present invention provides a method for detecting an adaptive target based on lightweight residual learning and deconvolution cascade, comprising the following steps:

s1: acquiring data through image acquisition equipment to obtain an image training data set and a test data set;

s2: constructing a lightweight depth residual error network, inputting an image training data set and a test data set in S1, and extracting features;

s3: selecting feature maps extracted from the last four levels in the lightweight residual error network, and convolving and fixing the dimensions of the output feature maps by 1x 1;

s4: constructing a multi-scale self-adaptive candidate area generation network, wherein the sizes of feature maps of different levels are different, the size of a feature map of a previous layer is larger than that of a feature map of a current layer, in order to fuse the feature maps of different levels, the feature maps of different sizes extracted in S3 are increased by using a deconvolution cascade structure, the spatial size of the feature maps of the previous layer is consistent, the feature maps are subjected to weighted fusion operation according to channel dimensions, and a candidate area generation network is adopted to generate a prediction target frame and category information;

s5: position correction and category regression are carried out by adopting a multitask loss function of the following formula through the position and category information of a prediction target frame generated by the self-adaptive candidate region;

L＝L_cls+L_reg+β₁L_loc+β₂L_shape (1)

wherein L is the overall loss function of the algorithm, L_clsRepresenting the classification loss function, L, in classifying features_regDenotes the regression loss function in the case of position regression, L_locIndicating the loss of positioning when positioning the object, L_shapeRepresenting the shape loss, beta, of the target detection frame₁And beta₂The weighting coefficients, which represent the multitask loss function, are 1 and 0.1, respectively.

Advantageous effects

According to the self-adaptive target detection method based on the light-weight residual error learning and the deconvolution cascade connection, in the aspect of feature extraction, a light-weight residual error network is adopted for feature extraction, the light-weight feature extraction network is established through the advantages of depth separable convolution and residual error learning, then feature fusion is carried out according to the deconvolution cascade connection, a multi-scale self-adaptive candidate region generation network is established, the consistency of the space sizes of feature maps of different levels is realized, weighting fusion operation is carried out, and finally, the semantic features are used for guiding the network to adaptively generate a target candidate frame which is more matched with a real target. Simulation experiments show that the method can effectively extract the target features in the video image sequence, enhances the expression capability of the features on the target, can quickly and accurately identify and position the target under the conditions of shielding, scale change and small target, has high precision and robustness, and simultaneously greatly reduces the calculated amount by a lightweight network and meets the detection real-time property.

Drawings

The foregoing and/or additional aspects and advantages of the present invention will become more readily appreciated from the following description of the embodiments taken in conjunction with the accompanying drawings, in which:

FIG. 1 is a flowchart of an adaptive target detection method based on lightweight residual learning and deconvolution cascade according to an embodiment of the present invention, and

FIG. 2 is a schematic diagram of a lightweight residual error network according to an embodiment of the present invention, an

FIG. 3 is a diagram illustrating multi-scale feature fusion of a deconvolution cascade structure according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the drawings are illustrative only and should not be construed as limiting the invention.

As shown in fig. 1, the adaptive target detection method based on lightweight residual learning and deconvolution cascade according to the present invention includes the following steps:

s1.1: preprocessing samples in the data set by cutting, turning, rotating, scaling and the like to expand the data set;

s1.2: extracting positive and negative samples in each image, labeling the positive samples to be detected, and labeling the position and the category of each target by using a rectangular frame;

s2: constructing a lightweight depth residual error network, inputting a training data set, and performing feature extraction;

the method comprises the following specific steps:

s2.1: inputting a training data set into a lightweight residual error network, and performing depth separable convolution on the image;

1) carrying out deep convolution on an input image, independently allocating a convolution kernel to each of N channels of the input image characteristic F, wherein each convolution kernel is only responsible for carrying out convolution operation on the image characteristic of the channel, the size of the convolution kernel is consistent with that of a convolution kernel of standard convolution, the number of the convolution kernels is N, the step length is 1, and the convolution kernel comprises padding operation;

2) performing point convolution on the feature map obtained through the depth convolution in the previous step, wherein the size of a convolution kernel is 1 multiplied by 1, and the number of the convolution kernels is L, so as to obtain a feature map of a specified channel dimension;

s2.2: connecting a shallow network and a deep network in a Skip Connection mode, and fusing the feature information of feature maps of different levels after convolution, namely fusing the feature information of a bottom layer into a high layer;

s3: selecting feature maps in the last four levels in the lightweight residual error network, and fixing the dimension of the output feature map by using 1x1 convolution;

s4: constructing a multi-scale self-adaptive candidate area generation network, increasing the resolution of a deep level feature map by using a deconvolution cascade structure, realizing the consistency with the space size of the previous level feature map, carrying out weighted fusion operation on the feature maps with consistent space sizes according to channel dimensions, and generating a prediction target frame and category information by using the candidate area generation network;

the method comprises the following steps:

s4.1: selecting a multilevel feature map { C2, C3, C4, C5} in the lightweight residual network, corresponding to the output of the last layer of each network level;

s4.2: the feature map size is made to be consistent with C4 by using deconvolution operation on the high-level feature map P5 (obtained by C5 through 1x1 convolution), and then the feature map is weighted and fused with the corresponding previous-level feature map C4 to obtain a new feature map P4.

S4.3: the process of S4.2 is repeated until a signature P2 is generated that is consistent in size with C2, with more detailed signature information for smaller targets. Therefore, on the basis of directly adding the same weight, the weights are additionally allocated to 6 different feature maps, and the weighted fusion formula is as follows:

where D (-) is the deconvolution transfer function, α₁、α₂、α₃、α₄、α₅And alpha₆And the values of the weight coefficients are respectively 0.7, 0.3, 0.6, 0.4, 0.45 and 0.55, and the sum of the weight coefficients fused in each layer is 1 in order to avoid characteristic information redundancy.

S4.4: inputting the feature map subjected to deconvolution cascade feature fusion into a self-adaptive candidate area generation network to obtain the center position and the shape of the Anchor, wherein the specific steps are as follows;

1) the Anchor shape generated according to the image characteristic self-adaption is changed according to different positions, and the Anchor characteristic self-adaption branch network N is adopted_TThe features are converted, so that the content of a larger region needs to be coded by the features of a larger Anchor, the content of a smaller region can be extracted by the features of a smaller Anchor, and the branch network is realized by adopting a 3 x 3 deformable convolution layer.

f_i'＝N_T(f_i,w_i,h_i) (3)

Wherein f is_iIs a feature of the ith position, w_i,h_iIs the corresponding Anchor shape. I.e. the offset is predicted from the output of the shape prediction branch, and then f is obtained using deformable convolution on the original feature map_i'。

2) Anchor's center prediction branch network N_LGenerating a sum input feature map F_IProbability map of the same size, P (i, j | F)_I) Indicating that the position of the feature map (i, j) is possibleThe probability of the current target object corresponds to the coordinates [ (I +1/2) s, (j +1/2) s in the image I]Where s is the step size of the feature map. N is a radical of_LThe branch network uses a 1x1 convolutional network to obtain a confidence map of the target, which is then converted to probability values using a sigmoid function. According to the generated probability map, determining a region where the target possibly exists by selecting a position (the value in a contrast experiment is 0.05) of which the corresponding probability value is larger than a predefined threshold value;

3) predicting branch N by shape after determining the likely location of the target_SThe network branch contains a convolutional layer of 1x1 size, which can generate a two-channel map containing values for dw and dh. Input feature map F_IAnd the shape prediction branch will predict the optimal shape (w, h) at each position, and because the range of w and h may be very large, the shape prediction branch will output dw and dh after the transformation of equation (6), and these two values can be mapped out to w and h, where s is the step size and λ is the empirical scale factor (the value in this experiment is 8). The non-linear transformation mapping may be to [0,1000 ]]Mapping to [ -1,1 [ ]]The shape prediction branch calculation is made simpler and more stable.

S5: performing position correction and category regression by adopting a multitask loss function of the following formula according to the position and the category of a prediction target frame generated by the self-adaptive candidate region;

L＝L_cls+L_reg+β₁L_loc+β₂L_shape (5)

wherein L is_clsAnd L_regRespectively representing the classification and regression losses, L, in a conventional network_shapeAnd L_locThe newly added anchor localization loss and anchor shape loss, respectively.

The method comprises the following steps:

s5.1: get the real target frame (x)_g,y_g,w_g,h_g) Feature map of (x'_g,y′_g,w′_g,h′_g) Respectively obtaining classification loss and regression loss by adopting a cross entropy loss function and a mean square error function, and then defining two regions (x ') in the target feature mapping region'_g,y′_g,δ₁w′_g,δ₁h′_g) And (x'_g,y′_g,δ₂w′_g,δ₂h′_g)，δ₁、δ₂The values are 0.2 and 0.5 respectively. (x'_g,y′_g,δ₁w′_g,δ₁h′_g) Is a central region of (x'_g,y′_g,δ₂w′_g,δ₂h′_g) The position without the central area is the neglected area, and the rest is the peripheral area.

S5.2: the central area is used as a positive sample, the peripheral area is used as a negative sample, and the Focal local is used for training the positioning branch L_loc；

S5.3: training shape prediction branches by adopting the following IOU calculation mode:

vIOU(a_wh,G)＝maxIOU(a_wh,G) (6)

where IOU (-) is the definition of a conventional IOU, G represents the real target box, a_whRepresenting the anchor variable, enumerating 9 as a for anchors of different proportions and sizes as is common_whAnd using the maximum value as the final vIOU (a)_whG). Determining shape loss L of target box anchor_shapeAs shown in the following formula (7), wherein l₁Is smoothL₁Loss functions, (w, h) and (w)_g,h_g) Respectively representing the predicted anchor shape and the corresponding true target shape.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes may be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. The self-adaptive target detection method based on lightweight residual learning and deconvolution cascade is characterized by comprising the following steps of:

L＝L_cls+L_reg+β₁L_loc+β₂L_shape (1)

2. The adaptive target detection method based on light-weighted residual learning and deconvolution cascade of claim 1, characterized in that:

in S1, acquiring data through an image acquisition device to obtain an image training data set and a test data set;

s1.1: preprocessing samples in the data set through cutting, turning, rotating and scale transformation to expand the data set;

s1.2: and extracting positive and negative samples in each image, labeling the positive samples to be detected, and labeling the position and the category of each target by using a rectangular frame.

3. The adaptive target detection method based on light-weighted residual learning and deconvolution cascade of claim 1, characterized in that: in S2, constructing a lightweight depth residual error network, inputting a training data set, and performing feature extraction;

the method comprises the following steps:

s2.2: the shallow network and the deep network are connected in a jump connection mode, feature information of feature maps of different levels after convolution is fused, and equivalently, feature information of a bottom layer is fused into a high layer.

4. The adaptive target detection method based on light-weighted residual learning and deconvolution cascade of claim 1, characterized in that:

in S4, a multi-scale self-adaptive candidate area generating network is constructed, the resolution of a deep level feature map is increased by using a deconvolution cascade structure, the spatial dimension of the deep level feature map is consistent with that of a previous level feature map, the feature maps with consistent spatial dimension are subjected to weighted fusion operation according to the channel dimension, and a candidate area generating network is adopted to generate a prediction target frame and category information;

the method comprises the following steps:

s4.2: performing deconvolution operation on the high-level feature map P5 (obtained by C5 through 1x1 convolution) to make the feature map size consistent with C4, and then performing weighted fusion on the feature map and the corresponding previous-level feature map C4 to obtain a new feature map P4;

s4.3: repeating the S4.2 process until a feature map P2 with the size consistent with that of C2 is generated, and the detail feature information of more small targets is possessed; therefore, on the basis of directly adding the same weight, the weights are additionally allocated to 6 different feature maps, and the weighted fusion formula is as follows:

where D (-) is the deconvolution transfer function, α₁、α₂、α₃、α₄、α₅And alpha₆Representing weight coefficients with the values of 0.7, 0.3, 0.6, 0.4, 0.45 and 0.55 respectively, wherein the sum of the weight coefficients fused in each layer is 1 in order to avoid characteristic information redundancy;

1) the Anchor shape generated according to the image characteristic self-adaption is changed according to different positions, and the Anchor characteristic self-adaption branch network N is adopted_TFeatures are transformed, and the branch network uses a 3 x 3 deformable convolution layerThe implementation is carried out;

f_i'＝N_T(f_i,w_i,h_i)

wherein f is_iIs a feature of the ith position, w_i,h_iIs a corresponding Anchor shape; i.e. the offset is predicted from the output of the shape prediction branch, and then f is obtained using deformable convolution on the original feature map_i'；

2) Anchor's center prediction branch network N_LGenerating a sum input feature map F_IProbability map of the same size, P (i, j | F)_I) The probability that the target object may appear at the position of the feature map (I, j) is represented by coordinates [ (I +1/2) s, (j +1/2) s) in the image I]Where s is the step size of the feature map; n is a radical of_LThe branch network uses a convolution network of 1x1 to obtain a confidence map of the target, and then converts the confidence map into a probability value by using a sigmoid function; according to the generated probability map, determining a region in which the target possibly exists by selecting a position with a corresponding probability value larger than a predefined threshold value;

3) predicting branch N by shape after determining the likely location of the target_SThe network branches contain a convolution layer of 1x1 size, producing a two-channel map containing values for dw and dh; input feature map F_IThe shape prediction branch will predict the best shape (w, h) for each position, and since the range of w and h may be large, the shape prediction branch will output dw and dh after transformation, and these two values can be mapped out as w and h, where s is the step size and λ is the empirical scale factor;

5. the adaptive target detection method based on light-weighted residual learning and deconvolution cascade of claim 1, characterized in that: in S5, position correction and category regression are performed by using the position and category of the prediction target frame generated by the adaptive candidate region and using the multitask loss function of the following formula;

L＝L_cls+L_reg+β₁L_loc+β₂L_shape

wherein L is_clsAnd L_regRespectively representing the classification and regression losses, L, in a conventional network_shapeAnd L_locThe anchor positioning loss and the anchor shape loss which are newly increased are respectively;

the method comprises the following steps:

s5.1: get the real target frame (x)_g,y_g,w_g,h_g) Feature map of (x'_g,y′_g,w′_g,h′_g) Respectively obtaining classification loss and regression loss by adopting a cross entropy loss function and a mean square error function, and then defining two regions (x ') in the target feature mapping region'_g,y′_g,δ₁w′_g,δ₁h′_g) And (x'_g,y′_g,δ₂w′_g,δ₂h′_g)，δ₁、δ₂The values are 0.2 and 0.5 respectively; (x'_g,y′_g,δ₁w′_g,δ₁h′_g) Is a central region of (x'_g,y′_g,δ₁w′_g,δ₁h′_g) Is a central region of (x'_g,y′_g,δ₂w′_g,δ₂h′_g) The position without the central area is an neglected area, and the rest part is a peripheral area;

vIOU(a_wh,G)＝maxIOU(a_wh,G)

where IOU (-) is the definition of IOU, G represents the real target box, a_whRepresenting the anchor variable, enumerating 9 as a for anchors of different proportions and sizes as is common_whAnd using the maximum value as the final vIOU (a)_whG); determining shape loss L of target box anchor_shape，l₁Is smooth L₁Loss functions, (w, h) and (w)_g,h_g) Respectively representing a predicted anchor shape and a corresponding real target shape;

。