CN110569782A

CN110569782A - Target detection method based on deep learning

Info

Publication number: CN110569782A
Application number: CN201910836094.9A
Authority: CN
Inventors: 赵骥; 于海龙; 吴晓翎
Original assignee: University of Science and Technology Liaoning USTL
Current assignee: University of Science and Technology Liaoning USTL
Priority date: 2019-09-05
Filing date: 2019-09-05
Publication date: 2019-12-13

Abstract

A target detection method based on deep learning replaces a VGG16 network used for extracting image features in a Faster RCNN method with a 101-layer residual error which has stronger expression capability and deeper layers; the structure of the residual error unit is changed into a pre-activation mode, so that the network is more smooth in the forward and backward propagation processes; and (3) introducing automatic deformable convolution by taking the most basic convolution as an improved entry point, and dynamically adjusting the size and the position of a convolution kernel according to the image content which needs to be identified currently. Based on the characteristics of diversity of target forms, small difference among classes, unclear targets, small targets, shielding among targets, complex located background and the like, a candidate region-based deep learning algorithm fast RCNN is improved, a new target detection method is established, and the target detection algorithm is strong in robustness: the detection result cannot be influenced no matter shielding, different illumination, similar background and unclear target, and the phenomena of missing detection and false detection are greatly reduced.

Description

target detection method based on deep learning

Technical Field

the invention relates to the technical field of computer vision, in particular to a target detection method based on deep learning.

Background

Vision is the main way for human beings to perceive external information, and provides a vital basis for people to distinguish things in sight. The target detection is one of the most classical research contents in the computer vision technology, and has important application value in new retail industry, intelligent traffic control, intelligent high-speed intersection management, community security and even in various scenes such as national military field and the like. The target detection mainly refers to detecting, extracting and segmenting a target from background information, quickly and accurately expressing and accurately positioning the target in an input image, and laying a foundation for reading and understanding information of target behaviors, so that the accuracy and the high efficiency of the target detection directly influence the quality of post-processing such as target identification.

currently, the following methods are mainly used for target detection: 1. the inter-frame difference method is to calculate the difference between two or more adjacent frames of image pixels in the image sequence and convert the difference into a binary image by setting a threshold value to determine a moving target. The target detection method only shows good effect under the condition that the background is static, but the target is difficult to detect under the condition that the textures and the color distribution of the background and the detected object are too uniform. 2. The background subtraction method is a method for obtaining a difference region by subtracting pixels in a current input image and pixels of a background image slice, is sensitive to the robustness of environmental changes and is only suitable for target detection of relatively stable motion of a background. 3. By utilizing HOG characteristics, Haar characteristics or SIFT characteristics, traversing the whole image by adopting sliding windows with different proportions, extracting the characteristics of the target, and then classifying the target in each window by using an SVM classifier and an AdaBoost classifier, wherein the exhaustion method consumes a large amount of time. 4. A multi-scale deformation component model DPM target detection algorithm is based on an SVM classifier and a sliding window detection idea by utilizing improved HOG characteristics and aiming at the multi-view problem of a target, a multi-component strategy is adopted, and the multi-scale deformation component model DPM target detection algorithm is only suitable for detection tasks of human faces and pedestrians, but is relatively complex and low in detection efficiency. 5. The SSD algorithm based on deep learning introduces a multi-scale concept, so that the SSD algorithm has an unsatisfactory effect of detecting a small target object. 6. A method for detecting fast RCNN based on the deep learning of a candidate region has a strong detection effect when a small target and a target with large overlapping degree between the targets are detected.

In recent years, target detection based on deep learning has received attention from researchers. Compared with the traditional detection method, the performance of the target detection method based on deep learning is greatly improved, but still has several disadvantages: 1. learning the features of the target is incomplete and an excessively small target cannot be detected. 2. Due to the fact that the candidate frames are removed by the fast RCNN method through the non-maximum suppression method, missing detection occurs when the target with the cross overlapping and shielding conditions is detected. The fast RCNN algorithm adopts a VGG16 network to extract the characteristics of the target, and as the geometric shape of a convolution kernel for convolution operation is fixed and unchangeable, the network geometric structure formed by the lamination of the network geometric structure is also fixed, and the extraction of the characteristics is limited to a certain degree, so that the geometric deformation cannot be well responded, and the execution effect of the target detection algorithm is not ideal for targets in different states presented under different visual angles.

Disclosure of Invention

In order to solve the technical problems in the background art, the invention provides a target detection method based on deep learning, which improves a candidate region-based deep learning algorithm Faster RCNN and establishes a new target detection method based on the characteristics of diversity of target forms, small inter-class difference, unclear target, too small target, shielding between targets, complex background and the like so as to more accurately detect the target.

In order to achieve the purpose, the invention adopts the following technical scheme:

A target detection method based on deep learning comprises the following steps:

1) Replacing a VGG16 network used for extracting image features in a Faster RCNN method with a 101-layer residual error with stronger expression capability and deeper layers;

The forward propagation of the residual error network is linear, the input of the rear layer is the sum of the input of the front layer and each residual error unit, and the calculation expression of the deep layer unit L can be obtained after multiple iterations:

wherein, X_Lrepresenting the output vector, x, of the L-th layer (deep layer)_lThe input vector of the l-th layer (shallow layer), F (x)_i,w_i) The residuals in layer i are shown.

2) the structure of the residual error unit is changed into a pre-activation mode, so that the network is more smooth in the forward and backward propagation processes; the pre-activation mode is specifically as follows:

in order to make network training easier, the conventional 'post-activation' structure of the residual unit is changed into 'pre-activation'; x in modified residual unit structure_LCan be regarded as X_lAnd residual error accumulation, so that the gradient can be completely transmitted back when the network performs a back propagation operation, the information transmission is smooth, and the improved residual error unit structure is represented by the following formula:

Wherein ε represents a loss error value expressed asx_LDenotes L layer prediction value, x_lablerepresents the value of the corresponding exponential vector of the L layer, F (x)_i,W_i) Representing the residual error.

3) The most basic convolution is used as an improved entry point, automatic deformable convolution is introduced, and the size and the position of a convolution kernel are dynamically adjusted according to the image content needing to be identified currently;

a set of n x n convolution kernelsdefining the size of the receptive field, outputting a feature map y for each pixel p on the conventional convolutional neural network_nAs a matter of fact, it is possible to obtain:

Wherein p is_nEnumerate and providePosition in, x (p)₀) Is a sample point, w (p)_n) Representing the corresponding weight.

In order to better adapt to the deformation of the target, automatic deformable convolution is introduced, and an offset variable delta p is added to the sampling point of each convolution kernel_nGiven the nature of the convolution kernel deformation, for an automatic deformable convolution the formula is shown:

wherein, Δ p_nfor each pixel point p_nA corresponding one of the offset amounts is,Middle element passing offset { Δ p_n1, …, N (where,)；

Therefore, sampling will occur at irregular positions p with offset_n+Δp_nthe above.

4) Aiming at the situation that cross overlapping and shielding often exist between targets, a soft non-maximum value suppression new strategy is adopted;

introducing a softNMS algorithm to carry out post-processing on the candidate frames, solving the problem of missing detection caused by most of overlapping of targets, wherein the score resetting function of the softNMS algorithm is as follows:

Wherein, iou (M, b)_i) For the intersection ratio of the highest candidate box currently scored and the true box, i.e. the intersection needs to be calculated first, and then the union passes the areas of the two bounding boxesAnd subtracting the intersection part to obtain a union, and performing a series of conversions through a log function to generate a new S_i。

When the overlapping degree of adjacent detection frames exceeds a threshold value, the probability scores of adjacent candidate frames are reduced through a correlation function instead of complete elimination, the probability values of the detection frames which are adjacent to M are greatly attenuated by the function, and the detection frames far away from M cannot be influenced and still remain in the object detection sequence.

Compared with the prior art, the invention has the beneficial effects that:

1) The target detection algorithm of the invention has strong robustness: the detection result cannot be influenced no matter shielding, different illumination, similar background and unclear target, and the phenomena of missing detection and false detection are greatly reduced.

2) According to the invention, in the fast RCNN method, the VGG16 network used for extracting image features is replaced by the 101-layer residual error transformation aggregation depth network ResNeXt with stronger expression capability and deeper layers, so that the purpose of completely learning the target features is achieved, and the problems of target detection caused by external factors, such as unclear target, too small target, high similarity between target and background, and the like, are solved.

3) With the increase of the number of network layers, the detection accuracy is improved, but more training time is consumed. In order to facilitate training, the structure of the residual error unit is changed, and a conventional 'post-activation' mode is changed into a 'pre-activation' mode, so that the network is more smooth in the forward and backward propagation processes.

4) The most basic convolution is taken as an improved entry point, and automatic deformable convolution is introduced. The size and the position of the convolution kernel can be dynamically adjusted according to the image content which needs to be identified at present, so that the method can better adapt to the target detection task of geometric deformation of the shape, the size and the like of an object, and solves the problems that the target is deformed and the target can be detected in different states under different visual angles.

5) aiming at the situation that cross overlapping and shielding often exist between targets, a non-maximum suppression method is adopted by a traditional fast RCNN algorithm to remove candidate frames, and the phenomenon of missing detection occurs. Aiming at the problem, a new strategy of suppressing the soft non-maximum value is adopted to achieve the purpose of solving the problem of missing detection.

Drawings

FIG. 1 is a modified residual unit structure of the present invention;

FIG. 2 is a network overview framework diagram of the present invention;

Fig. 3 is a flow chart of the SoftNMS algorithm.

Detailed Description

The following detailed description of the present invention will be made with reference to the accompanying drawings.

a target detection method based on deep learning comprises the following steps:

The ResNeXt network is used for learning the target characteristics and has a 101-layer structure, is an upgraded version of the ResNet network, reserves the basic stacking mode of the ResNet network, is formed by stacking blocks which are parallel and have the same topological structure, only splits paths of the ResNet into independent paths with the number (called as a base number) of 32, simultaneously performs convolution operation on input images by the 32 paths, and finally performs accumulation and summation on outputs from different paths to obtain a final result. The operation makes the division of the network more definite and the local adaptability stronger. The preposed network is replaced by a ResNeXt network with 101 layers, and meanwhile, the basic construction unit of the network has identity mapping and quick connection mechanisms, so that the learning capability of the model is far superior to that of other deep learning models.

Wherein, X_Lis shown asOutput vector of L-th layer (deep layer), x_lThe input vector of the l-th layer (shallow layer), F (x)_i,w_i) The residuals in layer i are shown.

2) In order to make the network training easier, the conventional "post-activation" structure of the residual unit is changed to "pre-activation". The conventional residual unit structure has two characteristics: the BN layer and the ReLU layer are behind the Conv layer, namely a Conv-BN-ReLU structure; 2. the second ReLU layer will follow the addition. The output of a conventional residual unit is:

X_l+1＝f(X_l+F(X_l，w_l))

Wherein F (X) is residual error, F is ReLU activation function, and X_l+1Is the output of the current layer, i.e. the input of the next layer.

in the conventional residual unit structure, a ReLU activation function exists after the weighted summation operation, and when the signal of the ReLU activation function is negative, the propagation is truncated, two branches of the residual unit are affected, so that information can only be propagated directly between two adjacent residual units. Therefore, both BN and ReLU are moved to the front of the weight layer, and at the same time, the ReLU activation function is moved to the residual function branch, and the identity mapping is constructed to form a "pre-activation" manner, so that the shortcut connection branch will not be affected, and the improved residual unit structure is as shown in fig. 1.

X in modified residual unit structure_Lcan be regarded as X_lAnd accumulation of residuals. In this way, when the network performs the back propagation operation, the gradient can be completely transmitted back, and the information transmission is smooth, and the improved residual error unit structure can be represented by the following formula:

Wherein ε represents a loss error value expressed asx_LDenotes L layer prediction value, x_lableRepresenting the corresponding direction of interest vector of L layerValue, F (x)_i,W_i) Representing the residual error.

3) In order to better adapt to the deformation of the target, automatic deformable convolution is introduced. Adding an offset variable Δ p to the sample points of each convolution kernel_nGiving the convolution kernel the property of deformation. After the offset variable is added, the network can automatically adjust the shape of the convolution kernel according to the error learning offset of back propagation, so that the size of the deformable convolution kernel and the position of a sampling point can be dynamically adjusted according to the image content, and the adaptive capacity of space geometric deformation is further enhanced.

Such as a 3 x 3 convolution kernel, first requires up-sampling 9 positions, sets, from the input image or feature map xThe size of the receptive field is defined.

Wherein (-1, -1) represents x (p)₀) Upper left sample point, (1,1) represents x (p)₀) Lower right sample point, and the like.

for each pixel p on the conventional convolutional neural network output feature map y₀as a matter of fact, it is possible to obtain:

Wherein p is_nEnumerate and provideOf (c) is used.

Wherein, Δ p_nfor each pixel point p_nA corresponding one of the offset amounts is,Middle element passing offset { Δ p_n1, …, N (where,)。

And (3) introducing a softNMS algorithm to carry out post-processing on the candidate frames, so that the problem of missed detection caused by most of overlapping of targets is solved. The softNMS algorithm score reset function is:

wherein, iou (M, b)_i) The method is characterized in that the intersection ratio of the highest candidate frame and the real frame of the current score is obtained, namely, firstly, the intersection needs to be calculated, then the intersection part is subtracted from the sum of the areas of the two frames of the union set to obtain the union set, and a series of conversions are carried out through a log function to generate a new S_i。

When the overlapping degree of the adjacent detection frames exceeds a threshold value, the probability scores of the adjacent candidate frames are reduced through a correlation function instead of being completely eliminated. The function will attenuate to a large extent the probability values of the detection boxes that are very close to M, and the detection boxes that are far from M will not be affected and will remain in the sequence of object detection.

Referring to fig. 2, the target detection method based on deep learning includes: the ResNeXt network with 101 layers extracts target features, the RPN network generates 300 regional suggestion boxes, candidate boxes are determined, classification is carried out, regression positioning is carried out on the frames, and a detection result is obtained.

The method comprises the following concrete steps:

Step 1, downloading the ImageNet pre-training model, putting the model under a designated folder, and taking the model as an initialization parameter.

and 2, preparing a Pascal _ VOC data set, and converting the data set into an lmd file format accepted by a buffer framework.

and 3, fine-tuning the parameters of the pre-training model by using a Pascal _ VOC data set.

and 4, placing the finally generated network model under a specified folder for target detection.

And 5, inputting the image to be detected with any size into the network.

Step 6, feature extraction, namely extracting by adopting a 101-layer ResNeXt network, wherein a convolution kernel adopts deformable convolution, the basic structure of a residual error unit is 'preactivation', and meanwhile, convolution layers of residual error blocks all adopt a 'bottle socket design' structure: firstly, reducing the calculated amount through 1 multiplied by 1 convolution layer dimensionality reduction; then, performing convolution by a convolution layer of 3 multiplied by 3; finally, the structure is restored by another convolution of 1 multiplied by 1, and the precision is not influenced on the premise of ensuring that the calculated amount is reduced.

And step 7, directly generating 300 high-quality suggestion frames by adopting a regional suggestion network (RPN), wherein the regional suggestion network and the detection network share image convolution characteristics, so that the cost for calculating the regional suggestion frames is greatly reduced. In the design of generating the suggestion window, a sliding window needs to be applied on the characteristic diagram, considering that the sizes of the targets to be detected are different, in order to cope with objects with different sizes, three sliding windows (anchors) with the length-width ratios of 1:1, 2:1 and 1:2 are adopted, three multiple scale scaling sliding windows of 8, 16 and 32 are provided, and therefore each pixel point can obtain 9 types of sliding windows.

The specific implementation of the RPN: first, a small net needs to be slid over the feature map output by the last convolutional layer, while fully connecting with the input convolutional map (spatial window of size n × n). The features of each sliding window are mapped to a low-dimensional vector. The vectors are then passed to two fully connected network layers, namely a block classification layer (cls layer for short) and a block regression layer (reg layer for short). Firstly, generating a sliding window by the block classification layer, cutting and filtering the sliding window, judging whether the sliding window is an object or a background by the softmax classification sliding window, outputting the probability of an object and the probability of a non-object, but not identifying what the object is specifically; the square frame regression layer is mainly used for calculating the regression offset of the boundary frame of the sliding window to obtain an accurate suggestion frame, and the output of the regression frame is four parameters related to the regression frame, namely, the central coordinates x and y, the width w and the length h of the regression frame.

Step 8 uses the softNMS algorithm to cull out the overlapping predicted candidate boxes, first sorting all the test boxes according to their score, selecting the test box M with the largest score, and suppressing all other test boxes that overlap (using a predetermined threshold) with the M box, which would attenuate to a large extent the probability value of the test box that is very close to M, the test boxes far from M will not be affected, remain in the object detection sequence, leaving the most valuable candidate boxes, this process applied recursively to the remaining boxes. The specific implementation flow is shown in figure 3.

Step 9 after the RoI Pooling generates 300 candidate boxes through the RPN network, the target detection network maps the suggested windows to the feature map of the last layer, and each suggested window is formed into a fixed size through the RoI Pooling layer. The layer mainly utilizes the candidate frame generated by the RPN and the feature map obtained by the last layer of the ResNeXt to realize mapping operation, obtain the feature map with fixed size, and reduce the data volume needing to be processed while keeping certain useful information. Since the RPN generates more than one rectangular frame, it needs to go through each candidate frame and reduce the coordinate value by 16 times, so that the candidate frame generated on the basis of the original image can be mapped onto the feature map, thereby determining an area on the feature map, in the area, according to the parameters pooled _ w:7 and pooled _ h:7, the area is divided into 49(7 × 7) areas with the same size, and simultaneously, the maximum pooling operation is adopted, and finally a 7 × 7 feature map is generated.

step 10 full join operation. Each neuron needs to be connected with each pixel point of the input feature map, so that shallow features are integrated to express different function representations of data.

Step 11, after the full connection operation is performed in the full connection layer, the concrete classification of the target is completed by using Softmax Loss.

and step 12, frame regression. The candidate window is generally defined by a (x, y, w, h) four-dimensional vector, where x, y represents the center point coordinate w of the candidate window for width and h represents height.

in order to position the target more accurately, the original candidate frame needs to be dynamically adjusted, so that the regression window closer to the real window G is obtained after the original window is adjustedI.e. given the original frame coordinates (P)_x，P_y，P_w，P_h) After the operations of translation and zoom

Step 13 enters iteration until detection is completed.

According to the technical scheme, the fast RCNN based on the candidate region is improved according to the characteristics of various forms of targets, unclear targets, too small targets, shielding between targets, partially shielded targets, complex background and the like in real life. Extracting target characteristics by adopting a ResNeXt network with 101 layers; the internal structure of the residual error unit is changed into a pre-activation form, so that the network training is easier; automatic deformable convolution is introduced, so that the network can completely depend on an internal mechanism to cope with various morphological changes of the target; and (4) introducing a softNMS algorithm to screen the candidate frames, and avoiding the condition of missing detection caused by large-area overlapping between targets. A large number of experiments show that the method has the characteristics of high accuracy and strong robustness for detecting the target.

The above embodiments are implemented on the premise of the technical solution of the present invention, and detailed embodiments and specific operation procedures are given, but the scope of the present invention is not limited to the above embodiments. The methods used in the above examples are conventional methods unless otherwise specified.

Claims

1. A target detection method based on deep learning is characterized by comprising the following steps:

wherein, X_LRepresenting the output vector, x, of the L-th layer_lrepresenting the input vector of layer I, F (x)_i，w_i) The residual error in the l-th layer is shown; the L layer is a deep layer, and the L layer is a shallow layer;

2) The structure of the residual error unit is changed into a pre-activation mode, so that the network is more smooth in the forward and backward propagation processes; the pre-activation mode specifically comprises the following steps:

Wherein ε represents a loss error value expressed asx_LIndicating L layers of preMeasured value, x_lαbleRepresents the value of the corresponding exponential vector of the L layer, F (x)_i，W_i) Represents the residual;

Wherein p is_nenumerate and provideposition in, x (p)₀) Is a sample point, w (p)_n) Representing the corresponding weight;

Wherein, Δ p_nfor each pixel point p_nA corresponding one of the offset amounts is,Middle element passing offset { Δ p_n1, …, N wherein,

2. The method for detecting the target based on the deep learning of claim 1, further comprising:

Aiming at the situation that cross overlapping and shielding often exist between targets, a soft non-maximum value suppression new strategy is adopted;

Wherein, iou (M, b)_i) The method is characterized in that the intersection ratio of the highest candidate frame and the real frame of the current score is obtained, namely, firstly, the intersection needs to be calculated, then the intersection part is subtracted from the sum of the areas of the two frames of the union set to obtain the union set, and a series of conversions are carried out through a log function to generate a new S_i；