CN112507861A

CN112507861A - Pedestrian detection method based on multilayer convolution feature fusion

Info

Publication number: CN112507861A
Application number: CN202011409937.6A
Authority: CN
Inventors: 马国军; 韩松; 夏健; 郑威
Original assignee: Jiangsu University of Science and Technology
Current assignee: Jiangsu University of Science and Technology
Priority date: 2020-12-04
Filing date: 2020-12-04
Publication date: 2021-03-16

Abstract

The invention discloses a pedestrian detection method with multi-layer convolution feature fusion, which constructs a new feature extraction network Darknet-61 by reconstructing a residual error network, so that the feature extraction network Darknet-61 has the capability of 6 times of down-sampling, the output of a YOLOv3 algorithm YOLO output layer is increased from 3 layers to 5 layers by the new feature extraction network Darknet-61, a target candidate box is obtained on the basis of a YOLOv3 algorithm with 5-layer output by a k-mean algorithm, and the subsequent processing is carried out on the current optimal candidate box in the target candidate box by an NMS method. According to the invention, by improving the Darknet-53 feature extraction network, four residual error networks and convolution layers are introduced, the down-sampling times are increased, 7 × 7 feature maps are output, the characterization capability of low-level features is enhanced, and the accuracy of large-scale pedestrian detection is improved.

Description

Pedestrian detection method based on multilayer convolution feature fusion

Technical Field

The invention relates to the field of pedestrian detection, in particular to a pedestrian detection method based on multilayer convolution feature fusion.

Background

Pedestrian detection is a core technology of intelligent equipment, can let machine equipment acquire video or image information around to utilize the powerful analysis ability of computer, carry out visual processing to the information that acquires, thereby have the ability of observing and analyzing to complicated thing like the people. The system helps people to complete various recognition and detection tasks.

Pedestrian detection is an important research direction in the field of computer vision, whether pedestrians exist or not is identified from images or video clips through a computer, if the pedestrians exist, specific positions of the pedestrians are further detected, and therefore the position coordinates of the pedestrians are accurately calibrated, and the method and the device are widely applied to the fields of manufacturing industry, military, medical treatment and the like.

The pedestrian is one of the most difficult objects that detect among the target detection, and traditional pedestrian detection algorithm based on artifical characteristic extraction can't obtain big breakthrough in detecting the precision, this is because the background is complicated various in real environment, and the pedestrian will cause the interference to detecting in different occasions, and the polytropy and the yardstick difference of pedestrian gesture simultaneously hardly carry out the pedestrian characterization through traditional characteristic, and the rate of missing inspection can increase in actual detection. The most serious is the shielding problem, which brings difficulty to sample collection, and the detection precision is necessarily tested.

In order to better solve the problems, the invention provides a pedestrian detection method based on multilayer convolution feature fusion. After the second downsampling in the feature extraction network, information obtained by connecting 2 residual error modules is fused with information of a lower-layer network, and a 112 × 112 feature map is output. And meanwhile, 4 residual modules and one downsampling convolution layer are added for downsampling for the 6 th time, the characteristic after downsampling outputs convolution characteristic through the residual modules, finally 7 × 7 characteristic graphs are output through the convolution modules, the output of the YOLO layer is changed from the original three scale characteristic graphs into five scale characteristic graphs, the detection accuracy of pedestrians on large and small targets is enhanced, and the characteristic description capability of the network is improved. The finally improved network has higher accuracy and lower false alarm rate, and simultaneously maintains the robustness of the original algorithm.

Disclosure of Invention

The invention provides a pedestrian detection method based on multi-layer convolution feature fusion, and aims to solve the problem of missing detection in pedestrian detection of different scales in the prior art.

The invention relates to a pedestrian detection method with fusion of multilayer convolution characteristics, which comprises the following steps:

step 1: constructing a residual error network of the characteristic extraction network Darknet, and merging the parameters of the BN layer in the basic unit of the residual error network into the convolution layer; constructing a feature extraction network according to the constructed residual error network, and recording as a feature extraction network Darknet-61;

step 2: constructing a characteristic pyramid network, and fusing 5 convolution characteristics of the image obtained by the characteristic extraction network Darknet-61 through 6 times of downsampling with the scale information of 7 × 7, 14 × 14, 28 × 28 and 56 × 56 output by YOLO; enabling a YOLO output layer in the yollov 3 algorithm to output feature maps of 5 scales, wherein the 5 scales comprise: 7 × 7, 14 × 14, 28 × 28, 56 × 56, 112 × 112;

and step 3: optimizing a YOLOv3 algorithm according to the characteristic extraction network Darknet-61 and a YOLO output layer;

and 4, step 4: obtaining a plurality of target candidate boxes on the feature map with 5 scales output by the optimized YOLOv3 algorithm by using a k-means algorithm;

and 5: and selecting the target candidate frame with the largest IOU from the plurality of target candidate frames on the feature map by applying an NMS (network management system) method in the candidate frames, and predicting the pedestrian target according to the selected target candidate frame.

Further, in the step 1, merging the parameters of the BN layer in the residual network basic unit into the convolutional layer thereof, specifically:

wherein, W_mergedThe combined convolution weight bias value is obtained; w is the convolution weight; b is_mergedTo mergeThe subsequent convolution offset; b is convolution offset; mu is a mean value; sigma²Is the variance; gamma is a scaling factor; β is an offset; epsilon is a small number.

Further, in the step 2, the feature extraction network Darknet-61 obtains 5 convolution features of the image through six times of downsampling, and specifically includes the following steps:

step A21: using the image of size 448 x 448 as the network input of Darknet-61, performing a first down-sampling;

step A22: performing second downsampling, performing feature extraction on the second downsampling result by using 2 residual error networks constructed in the step 1, and outputting a first convolution feature of 112 × 128;

step A23: performing third downsampling, performing feature extraction on the third downsampling result by using 8 residual error networks constructed in the step 1, and outputting a second convolution feature of 56 × 256;

step A24: performing fourth down-sampling, performing feature extraction on the fourth down-sampling result by using convolution with a channel of 512, and outputting a third convolution feature of 28 × 512;

step A25: performing fifth downsampling, performing feature extraction on the fifth downsampling result by using 4 residual error networks constructed in the step 1, and outputting a fourth convolution feature of 14 × 1024;

step A26: and performing sixth downsampling, performing feature extraction on the sixth downsampling result by using 4 residual error networks constructed in the step 1, and outputting a fifth convolution feature of 7 × 2028.

Further, the specific steps of step 2 are as follows:

step B21: the feature extraction network Darknet-61 obtains 5 convolution features of the image through six times of down sampling, and obtains a feature map with 7 x 7 scales through convolution of the fifth convolution feature;

constructing a characteristic pyramid network, and performing characteristic fusion on the characteristic graph with the 7 × 7 scale and the fourth convolution characteristic through the characteristic pyramid network to obtain a characteristic graph with the 14 × 14 scale;

step B22: carrying out feature fusion on the feature map with the 14 × 14 scale and the third convolution features through a feature pyramid network to obtain a feature map with the 28 × 28 scale;

step B23: carrying out feature fusion on the feature map with the scale of 28 × 28 and the second convolution features through a feature pyramid network to obtain a feature map with the scale of 56 × 56;

step B24: and carrying out feature fusion on the feature map with the 56 × 56 scale and the first convolution features through a feature pyramid network to obtain a feature map with the 112 × 112 scale.

Further, the number of the target candidate frames acquired in the step 4 is 3.

The method has the following advantages:

1. by improving a Darknet-53 feature extraction network, introducing four residual error networks and convolution layers, increasing down-sampling times, outputting a 7 x 7 feature map, enhancing the characterization capability of low-level features and improving the accuracy of large-scale pedestrian detection;

2. in the feature extraction network, the second down-sampling uses 2 residual modules to obtain information, the information is fused with the information of the lower network, and a 112 × 112 feature map is output, so that the accuracy of detecting the small-scale pedestrians is improved;

3. the FPN is utilized to fully fuse the deep layer feature information and the shallow layer feature information of the image, the output of the YOLO layer is increased from the original three-scale feature maps to five-scale feature maps, and the detection effect on large and small pedestrian targets and mutually-shielded pedestrian targets is enhanced. The robustness of pedestrian detection is improved;

4. parameters of a BN layer in a residual error network basic unit are merged into a convolution layer of the BN layer, so that the calculation amount is reduced, and the detection speed is improved.

Under the condition that the detection speed is kept to meet the real-time requirement, the detection precision is effectively improved, and particularly the effect of small target detection is effectively improved.

Drawings

The features and advantages of the present invention will be more clearly understood by reference to the accompanying drawings, which are illustrative and not to be construed as limiting the invention in any way, and in which:

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a structure of a feature extraction network Darket-61 of the present invention;

FIG. 3 is a network structure of YOLOv3 according to the present invention;

fig. 4 is a schematic structural diagram of an FPN module according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a flowchart of a pedestrian detection method with multi-layer convolution feature fusion according to the present invention, fig. 2 is a structure of a feature extraction network dark-61 according to the present invention, fig. 3 is a network structure of YOLOv3 according to the present invention, fig. 4 is a schematic diagram of a structure of an FPN module according to an embodiment of the present invention, and with reference to fig. 1, fig. 2, fig. 3, and fig. 4, steps of a specific embodiment of the present invention are:

in step 1, parameters of a BN layer in a residual network basic unit are merged into a convolutional layer thereof, which specifically comprises:

wherein, W_mergedThe combined convolution weight bias value is obtained; w is the convolution weight; b is_mergedBiasing for the combined convolution; b is convolution offset; mu is a mean value; sigma²Is the variance; gamma is a scaling factor; beta is an offset(ii) a Epsilon is a small number.

further, in step 2, the feature extraction network Darknet-61 obtains 5 convolution features of the image through six times of downsampling, and the specific steps are as follows:

Further, the specific steps of step 2 are as follows:

And the superficial layer information and the deep layer characteristic information are fused to enhance the representation capability of the image pyramid. The obtained 7 × 7 and 14 × 14 feature maps are suitable for detecting large-size pedestrian targets in the image, the obtained 28 × 28 and 56 × 56 feature maps are suitable for detecting large-size pedestrian targets in the image, and the obtained 112 × 112 feature maps are suitable for detecting small-size pedestrian targets in the image, so that the missing rate of pedestrians is reduced.

further, after the second down-sampling of the feature extraction network Darknet-61, information obtained by connecting 2 residual modules is fused with the up-sampling information of the lower network, and a 112 × 112 feature map is output. And 4 residual modules and a downsampling convolution layer are added for downsampling for the 6 th time, the characteristic after downsampling outputs convolution characteristic through the residual modules, finally 7 x 7 characteristic graphs are output through the convolution modules, the output of the YOLO layer is changed from the original three scale characteristic graphs into five scale characteristic graphs, and the detection accuracy of pedestrians on large and small targets is enhanced.

in order to find the size of the candidate frame, the best centroid point is found by using a k-means algorithm, and the best k value and the distance function are considered, wherein the method specifically comprises the following steps:

1) the K values are sequentially increased from three times to find the optimal K value;

2) taking the found K as the starting point of clustering, taking the best clustering result as the optimal centroid, and then iterating according to a clustering rule to obtain the optimal clustering result;

3) calculating the distances from all the data to the mass center, and classifying the data into corresponding sets nearby;

4) after all data are calculated, the mass center of each set is recalculated;

5) performing set classification on all data again according to the newly calculated centroid;

6) repeating steps (4) and (5) until the newly calculated centroid does not change any more or the distance of the two centroids reaches the threshold value that we expect, and the algorithm terminates.

In the distance function, firstly, 15 frames are selected as candidate frames from one detection point, the IOU value is used as the distance function in the clustering function to calculate the centroid point, and the optimal distance function is (1-IOU)²The IOU value obtained under the condition is better, and the difficulty of middle frame regression in later training is reduced. Because a single object is detected, the weight of class prediction in the loss function is reduced, and the network can be converged better, namely, the class error function gives smaller weight to prevent interference to the overall error and optimize the loss function. The loss function expression is as follows:

wherein S²Indicating the number of grids, B indicating the number of frames generated for each grid, C indicating the type of detectable identification,

indicating that the object falls within the jth bbox of the lattice i, either 0 or 1, λ_coordIs a weight for coordinate errors, typically taken as 5, λ_noobjIs a weight for the wide-high error. (x)_i,y_i) Represents the coordinates of the center point of the ith pre-selected box, (w)_i,h_i) Representing the ith pre-selected boxWidth and height.

The coordinate prediction of the first part of the formula is a loss function of the position and the size of the bounding box, and the root cutting of the width and the height of the second part is to reduce the difference of the bounding box with larger size difference; and the third part of the formula calculates that if an object falls into the boundary box, the confidence coefficient of the predicted boundary box containing the object and the loss of the real object and the boundary box IOU are calculated, and if no object center falls into the boundary box, the confidence coefficient of the predicted boundary box containing the object is smaller and better. However, most bounding boxes have no object, and the product is too small and too large, which causes the deviation of loss, therefore, the fourth part of the formula is added with the weight λ_noobj0.5; in section 5 of the equation, if the object is included in the suggestion box, the probability of predicting the correct category is better as closer to 1, and the probability of the wrong category is better as closer to 0.

And 5: and selecting the target candidate frame with the largest IOU from the plurality of target candidate frames on the feature map by applying an NMS (network management system) method in the candidate frames, and predicting the pedestrian target according to the selected target candidate frame. The method comprises the following specific steps:

1) the extracted 5 scale feature graphs are sent to a YOLO network for detection, the maximum iteration number set by the method is 4000, the batch _ size is set to 64, the subdivisions is set to 16, the decay is 0.0005, the momentum is 0.9, the initial learning rate is 0.001, according to the trend of loss reduction, the learning rate can be properly adjusted, and the training is stopped until the loss function value output by the training data set is less than or equal to the threshold value or the set maximum iteration number is reached, so that the trained improved network is obtained.

2) Selecting an optimal target boundary frame by adopting a non-maximum value inhibition method, arranging the candidate frames according to the numerical values of the confidence coefficients, calculating the IOU values of the candidate frames and the real target frames to generate an IOU queue, selecting the boundary frame with the maximum IOU value to generate a prediction frame, and finally converting the coordinates of the prediction frame to an original image to output a prediction result.

In order to increase the detection speed of the pedestrian detection system, the feature extraction network darknet-61 and the FPN module in the embodiment are provided with a NVIDIA GTX 1080Ti GPU computer and a Ubuntu 16.04 system, so that real-time detection can be realized.

Although the embodiments of the present invention have been described in conjunction with the accompanying drawings, those skilled in the art may make various modifications and variations without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope defined by the appended claims.

Claims

1. A pedestrian detection method based on multilayer convolution feature fusion is characterized by comprising the following steps:

2. The pedestrian detection method with fusion of multilayer convolution features according to claim 1, wherein the step 1 of merging the parameters of the BN layer in the residual network basic unit into its convolution layer specifically comprises:

wherein, W_mergedThe combined convolution weight bias value is obtained; w is the convolution weight; b is_mergedBiasing for the combined convolution; b is convolution offset; mu is a mean value; sigma²Is the variance; gamma is a scaling factor; β is an offset; epsilon is a small number.

3. The pedestrian detection method with fusion of multilayer convolution features as claimed in claim 1 or 2, wherein the feature extraction network Darknet-61 in step 2 obtains 5 convolution features of the image through six times of downsampling, and the specific steps are as follows:

4. The pedestrian detection method with fusion of multilayer convolution features according to claim 3, characterized in that the specific steps of the step 2 are as follows:

5. The pedestrian detection method by multi-layer convolution feature fusion according to claim 1, wherein the number of target candidate frames acquired in the step 4 is 3.