CN111027493B

CN111027493B - Pedestrian detection method based on deep learning multi-network soft fusion

Info

Publication number: CN111027493B
Application number: CN201911284456.4A
Authority: CN
Inventors: 袁国慧; 叶涛; 王卓然; 彭真明; 潘为年; 柳杨; 孙煜成; 周宇; 杨博文; 张文超
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2019-12-13
Filing date: 2019-12-13
Publication date: 2022-05-20
Anticipated expiration: 2039-12-13
Also published as: CN111027493A

Abstract

The invention discloses a pedestrian detection method based on deep learning multi-network soft fusion, and relates to the technical field of image processing, target detection and deep learning; it includes S1: inputting an image to be processed; s2: inputting an image to be processed into a Yolo v3 pedestrian candidate region generator of a network based on Darknet-53 to generate a pedestrian candidate region; s3: inputting an image to be processed into a front-end prediction module, and outputting C feature maps; s4: inputting the C feature maps into a semantic segmentation system, and outputting C feature maps containing context information; s5: fusing the result of the semantic segmentation system with the pedestrian candidate result generated by the pedestrian candidate area generator; s6: and outputting the detection image. The invention combines the soft fusion pedestrian candidate area generator and the semantic segmentation system, efficiently detects pedestrians under various challenging scenes, and simultaneously improves the detection capability of small targets.

Description

Pedestrian detection method based on deep learning multi-network soft fusion

Technical Field

The invention relates to the technical field of image processing, target detection and deep learning, in particular to a pedestrian detection method based on deep learning multi-network soft fusion.

Background

Object detection is an important problem in computer vision, which requires detecting the position of an object in a video or digital image. The target detection is widely applied to the fields of image detection, target recognition, video monitoring and the like. Pedestrian detection, a branch of the object detection problem, involves detecting specific human categories, and has wide application in the fields of automatic driving, person recognition, robotics, and the like.

The pedestrian detection algorithm aims at drawing a boundary box in an image or a video and accurately describing the position of a pedestrian in real time. However, this is difficult to achieve due to the tradeoff between accuracy and speed. Because the input with low resolution can realize rapid target detection, but the target detection accuracy is poor; high resolution input may enable more accurate target detection, but at a slower processing speed. When processing relatively simple image scenes and sharp foreground objects, a general pedestrian detection algorithm can achieve good results. But it is more challenging to accurately describe the pedestrian's location in real time in certain circumstances, such as crowded scenes, non-human object occlusions, different appearances of pedestrians (different poses or clothing styles).

The main parts of pedestrian detection can be divided into three parts, namely generation of region proposals, feature extraction and pedestrian confirmation. Conventional methods typically use sliding window based techniques to generate region proposals, histogram of gradient directions (HOG) or Scale Invariant Feature Transform (SIFT) or the like as feature extractors, Support Vector Machines (SVM) or adaptive boosting (AdBoost) or the like as pedestrian confirmation methods; with the development of deep learning, the application of the method in pedestrian detection is more and more, and the mainstream methods are divided into two types: object candidate Based (Object probable Based) and Regression Based (Regression Based) methods. The object candidate Region-based method, also referred to as a second-order method, first generates a set of candidate Bounding boxes (Bounding boxes) that may contain pedestrians by using a Region Proposal (Region pro-posal) module, and then classifies and regresses the Bounding boxes using a deep convolutional neural network. In various pedestrian detection methods based on object candidate regions, improvement and improvement of detection performance are mainly performed based on RCNN, Fast RCNN and Fast RCNN series. The regression-based target detection method is also called a first-order method, and compared with the target candidate region-based method, the regression-based pedestrian detection method is much simpler, does not need candidate region extraction and subsequent resampling operation, and can realize real-time detection to a certain extent, but has lower detection performance than the second-order method. In various regression-based pedestrian detection methods, the method is mainly based on a YOLO series, and an SSD series is improved to improve the detection performance as much as possible, so that real-time and efficient detection is realized.

Disclosure of Invention

The invention aims to: the invention provides a pedestrian detection method based on deep learning multi-network soft fusion, which solves the problem that the position of a pedestrian cannot be accurately described in real time in the face of the balance between the pedestrian detection accuracy and the speed in the conventional method, and can improve the detection capability under the condition of realizing real-time detection.

The technical scheme adopted by the invention is as follows:

a pedestrian detection method based on deep learning multi-network soft fusion comprises the following steps:

step 1: inputting an image to be processed;

step 2: inputting the image in the step 1 into a YOLO v3 pedestrian candidate area generator of a network based on Darknet-53 to generate a pedestrian candidate area;

and step 3: inputting the image in the step 1 into a front-end prediction module, and outputting C feature maps;

and 4, step 4: inputting the C feature maps in the step 3 into a semantic segmentation system, and outputting C binary mask feature maps containing context information;

and 5: performing soft fusion on the result of the semantic segmentation system and the pedestrian candidate result generated by the pedestrian candidate region generator;

step 6: and outputting the detection image.

Preferably, the step 2 comprises the steps of:

step 2.1, dividing an input picture into S multiplied by S cells, distributing 3 pedestrian candidate area boundary frames to be predicted for each cell, and training YOLO v3 to obtain coordinate position information and confidence corresponding to each predicted pedestrian candidate area boundary frame;

2.2, fusing 3 scales in a YOLOv3 network, and independently detecting pedestrians on the fusion characteristic graphs of the scales to obtain coordinate position information of a pedestrian candidate area;

secondly, clustering the data set by using a K-means clustering algorithm to generate an initial value of an anchor frame, distributing 3 anchor frames under each scale, and predicting 3 pedestrian candidate region boundary frames by each cell, wherein the 3 pedestrian candidate region boundary frames correspond to 3 anchor frames, so that 9 anchor frames are totally distributed under 3 scales;

each cell outputs (1+4+ C) × 3 values, 4 represents 4 predicted positioning information, 1 represents 1 confidence score, 3 represents 3 anchor boxes and C represents C conditional class probabilities, where C is 1 and only pedestrians are classified, so 18 values are output;

and predicting the coordinate position information of the boundary frame of each pedestrian candidate area by adopting logistic regression:

b_x＝σ(t_x)+c_x

b_y＝σ(t_y)+c_y

wherein: σ is Sigmoid activation function, (t)_x,t_y,t_w,t_h) 4 predicted positioning information, p, for Yolo v3 web learning_w，p_hIs the width, height, c, of the preset prior frame_x，c_yIs the coordinate offset of the cell, (b)_x,b_y,b_w,b_h) Coordinate position information of a boundary frame of a pedestrian candidate area which is finally predicted;

yolo v3 training (t)_x,t_y,t_w,t_h) The target loss function of (2) is obtained by the following formula:

wherein: lambda [ alpha ]_coordAnd λ_noobjConstant is the class ratio used to balance the prediction box with the object to the prediction box without the object; t'_x、t'_y、t'_wAnd t'_hRepresents a tag value;

if the corresponding real object (ground route) is in the jth prediction box of the ith grid point, returning to 1, otherwise, returning to 0;

if a j prediction box represented at the ith grid point has a corresponding group route, returning to 0, otherwise, returning to 1, p_i(c) Probability of object class, here pedestrian, c_i' is the product of the probability of containing an object and the intersection IOU of the predicted bounding box and the label bounding box, i.e

c_iThe intersection IOU value, namely the confidence coefficient, of the predicted bounding box and the label bounding box;

step 2.3, in the process of YOLO v3 training, increasing a confidence receiving range in an original network of YOLO v3, namely reducing a confidence threshold of a detected pedestrian candidate region, generating a large number of pedestrian candidate regions, and ensuring that the candidate regions cover all pedestrians in an image to be detected; the specific setting of the training parameters is as follows: after the initial learning rate was set to 0.001 and 40000 batchs, the learning rate decreased to 1/10, i.e., to 0.0001, and after 45000 batchs, the learning rate continued to decrease to 0.00001, for a total of 50000 batchs.

Preferably, the step 3 comprises the steps of:

step 3.1, modifying the VGG-16 network, converting a complete connection layer into a convolutional layer, deleting the penultimate and third penultimate maximum pooling (Maxpool) and a line-crossing layer in the VGG-16 network structure so as to obtain a front-end prediction module, performing initialization training by using parameters of an original classification network, and outputting a feature map with higher resolution;

and 3.2, using a front-end prediction module to perform intensive prediction on the image to be detected to generate C64 multiplied by 64 preliminary semantic feature maps.

Preferably, the step 4 comprises the steps of:

step 4.1, a semantic segmentation system is constructed by utilizing the aggregated multi-scale context information, the input of the semantic segmentation system is C64 multiplied by 64 preliminary semantic feature maps generated by a front-end prediction module, the semantic segmentation system comprises 8 layers of networks, the front 7 layers are basic aggregated multi-scale context information modules, 3 multiplied by C expansion convolution kernels with different expansion factors are respectively applied to the front 7 layers for feature extraction, wherein the convolution is directly performed on the 1 st layer, the expansion convolution is performed on the 2 nd to 6 th layers by using different expansion factors, and the convolution is directly performed on the 7 th layer. Each convolution is followed by a point truncation max (·, 0) to truncate the part of the excess image, keeping the image size the same before and after the convolution. The last layer, layer 8, performs a 1 x 1 convolution. And finally, training the semantic segmentation system so that the semantic segmentation system outputs C64 x 64 refined semantic feature maps.

And 4.2, expanding convolution and aggregation multi-scale context information, and supporting exponential expansion of the receptive field without losing resolution or coverage rate. The size of the expanded area is (2)ⁱ⁺²-1)×(2ⁱ⁺²-1) a spreading factor of 2ⁱThe size of the receptive field is ((2)ⁱ⁺²-1)-(2ⁱ⁺¹-2))×((2ⁱ⁺²-1)-(2ⁱ⁺¹-2)), i ═ 0,1,. and n-2 denote the second dilation; during expansion, expansion was stopped when the size of the receptive field substantially coincided with the input size, so that the expansion factors for layers 2 to 6 were 1, 2, 4, 8 and 16, respectively, and the expanded receptive field was 5 x 5, 9 x 9, 17 x 17, 33 x 33 and 65 x 65, respectively.

And 4.3, training the semantic segmentation system and the front-end prediction module in the step 3 on the Cityscapes data set in a combined manner, setting the 'person' and 'rider' classes in the Cityscapes data set as pedestrians, and setting the rest classes as backgrounds. The training parameters are specifically set as follows: by using a stochastic gradient descent method (SGD), the minimum batch is 14, the initial learning rate is set to 0.001, after 40000 lots, the learning rate is reduced to 1/10, i.e., to 0.0001, and after 45000 lots, the learning rate continues to decay to 0.00001, which is 60000 lots.

Preferably, the specific steps of step 5 are:

and 5.1, generating a binary mask feature map from the semantic feature map in the step 4, wherein the foreground pixels are set to be 1 to represent interested categories (such as pedestrians), and the background pixels are set to be 0.

Step 5.2, generating the pedestrian candidate area bounding box (b) generated by the pedestrian candidate area generator in the step 2_x,b_y,b_w,b_h) Mapping the coordinate position information to a binary mask feature map to obtain a pedestrian candidate region boundary box on the binary mask feature map; scaling the pedestrian candidate region bounding boxes on all the binary mask feature maps to have the same size as the pedestrian kernel;

step 5.3, the soft fusion scale factor is used for weighting and calculating the pixels and the pedestrian kernels in the boundary box of the pedestrian candidate area on the binary mask feature map, and the calculation mode is as follows:

S_Result＝S_YOLOv3×S_ss

wherein: s_ssThe result of the semantic segmentation characteristic graph output by the semantic segmentation system is the score of the pedestrian; s_YOLOv3A score indicating that the pedestrian candidate region generator outputs a pedestrian candidate region result as a pedestrian; s_ResultA score indicating that the final output result is a pedestrian; a. the_BBIs the area of the bounding box; mask (i, j) is the binary mask pixel value at (i, j) in the image; kernel (i, j)) Is the pedestrian kernel at (i, j) in the image. Kernel, which tends to have higher pixel values in the center than at the boundary, coincides with the object of interest being in the center of the bounding box, and has the effect of enhancing detection, its bounding box fitting the object of interest (e.g., a pedestrian).

Step 5.4, according to S_ResultAnd (4) removing the boundary frame of the false detection pedestrian in the pedestrian candidate area in the step (2) to finally obtain the real pedestrian detection frame.

In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:

1. the invention effectively improves the pedestrian detection precision by using YOLOv3 as a pedestrian candidate region generator to generate a large number of pedestrian candidate frames;

2. the invention utilizes the front-end prediction module and semantic segmentation to classify the input images at pixel level, thereby avoiding the problem of rough detection of regression frame networks such as YOLOv3 and the like, improving the target detection capability and effectively solving the problem of insufficient detection precision of a single network;

3. the method utilizes soft fusion to fuse the pedestrian candidate frame and the semantic segmentation binary mask, thereby finely outputting a result; meanwhile, the small target detection capability is improved by combining the two, and the application range is wider;

4. the pedestrian detection system framework is formed by parallelly networking a pedestrian candidate area generator and a semantic segmentation system, so that rapid detection is realized; the system can accurately, efficiently and robustly detect pedestrians and other target classes in various challenging scenes.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

FIG. 1 is a flow chart of a pedestrian detection system of the present invention;

FIG. 2 is a network structure of the candidate pedestrian generator YOLOv3 of FIG. 1 according to the present invention;

FIG. 3 is a coordinate transformation formula diagram of the Bounding Box in FIG. 2 according to the present invention;

FIG. 4 is a schematic diagram of the architecture of the front-end prediction module base network VGG-16 of FIG. 1 in accordance with the present invention;

FIG. 5 is a convolution structure of the 0 th dilation in the semantic segmentation system of FIG. 1 according to the present invention;

FIG. 6 is a convolution structure of the 1 st dilation in the semantic segmentation system of FIG. 1 according to the present invention;

FIG. 7 is a graph showing the results of the soft fusion of FIG. 1 according to the present invention;

FIG. 8 is a diagram illustrating a context network architecture of the semantic segmentation system of FIG. 1 according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the detailed description and specific examples, while indicating the preferred embodiment of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

It is noted that relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The features and properties of the present invention are described in further detail below with reference to examples.

Example one

A pedestrian detection method based on deep learning multi-network soft fusion is characterized in that a flow chart of an implementation mode is shown in figure 1 and comprises two parallel operation parts of pedestrian candidate region extraction and pedestrian semantic segmentation, wherein the final pedestrian detection result of the whole system is refined through semantic segmentation, the operation speed of the system depends on a branch with slow processing, and finally the two results are fused and output in a soft fusion mode. The method specifically comprises the following steps:

step 1: and inputting an image to be processed.

Step 2: inputting the image in step 1 into a YOLOv3 pedestrian candidate area generator of the Darknet-53-based network in FIG. 2, and generating a pedestrian candidate area.

Further, the implementation steps of YOLOv3 in step 2 are as follows:

step 2.1, firstly, 3 scales (13 × 13, 26 × 26 and 52 × 52) are fused in the YOLOv3 network, detection is independently performed on the fusion feature maps of the multiple scales respectively, and the detection effect on the small target is enhanced. Secondly, clustering the data set by using a K-means clustering algorithm to generate an initial value of an Anchor Box, distributing 3 Anchor boxes under each scale, predicting 3 Bounding boxes for each cell, corresponding to the 3 Anchor boxes, and outputting (1+4+ C) × 3 values (4 positioning information, 1 confidence score and C conditional class probabilities) for each cell. Finally, the 4-dimensional position value t is calculated by the following formula_x,t_y,t_w,t_hDecoding is performed as shown in FIG. 3Obtaining the central point coordinates (x, y) and the width and height (w, h) of the prediction frame:

b_x＝σ(t_x)+c_x

b_y＝σ(t_y)+c_y

wherein: sigma (t)_x)，σ(t_y) Is the offset based on the coordinate of the grid point at the upper left corner of the center point of the rectangular frame, sigma is the Sigmoid activation function, p_w，p_hThe width and height of the prior frame are calculated by the formula_w，b_h。

The multitask training target loss function of YOLO v3 is given by the following equation:

wherein: λ is used to balance the class ratio of the prediction box with the object to the prediction box without the object; t'_x、t'_y、t'_wAnd t'_hRepresents a tag value;

if the corresponding group route is in the jth prediction box represented at the ith grid point, returning to 1, otherwise, returning to 0;

and if the jth prediction frame which represents the ith grid point has the corresponding group route, returning to 0, otherwise, returning to 1.

And 2.2, by utilizing the characteristic that each pedestrian candidate region is associated with the coordinate of the positioning frame and the confidence score of the pedestrian candidate region, firstly reducing the confidence threshold of the YOLO v3 detection candidate region, then generating a large number of candidate regions, and finally detecting all real pedestrians.

And 2.3, loading a pre-training model Darknet-53 obtained by training on ImageNet, deleting an original classifier, then carrying out fine tuning training on a Cityscapes data set, using an Adam optimizer in the training process, and expanding a model training sample by adopting a horizontal turning, angle adjustment, exposure, hue, saturation and other data enhancement mode when training the model, thereby enhancing the generalization performance of the model and reducing overfitting. After the initial learning rate was set to 0.001 and 40000 batchs, the learning rate decreased to 1/10, i.e., to 0.0001, and after 45000 batchs, the learning rate continued to decrease to 0.00001.

And step 3: and (3) inputting the image in the step (1) into a front-end prediction module, and outputting C characteristic graphs.

Further, the specific steps of step 3 are as follows:

and 3.1, converting the complete connection layer in the VGG-16 into a convolutional layer, and deleting the penultimate and third Maxpool and the cross-row layer in the VGG-16 network structure. Specifically, each of the Maxpool layer and the cross-row layer is removed, and for each deleted layer, the convolutions in all layers thereafter are enlarged by a factor of 2, and the convolutions in all layers thereafter are enlarged by a factor of 2 for each deleted layer. Thus, the convolution in the final layer after the two deleted layers is expanded by a factor of 4 and initialized using the parameters of the original classification network, resulting in a higher resolution output. Finally, a feature map is generated at a resolution of 64 × 64.

And 3.2, adjusting the VGG-16 network structure in the figure 4 to obtain a front-end prediction module so as to perform intensive prediction.

And 4, step 4: and (4) inputting the C feature maps in the step (3) into a semantic segmentation system, and outputting C feature maps containing context information.

Further, the specific steps of step 4 are as follows:

step 4.1, a semantic segmentation system is constructed by utilizing aggregated multi-scale context information, the input of the semantic segmentation system is C64 x 64 preliminary semantic feature maps generated by a front-end prediction module, the semantic segmentation system comprises 8 layers of networks, the network structure form of the semantic segmentation system is shown in FIG. 8, the first 7 layers are basic aggregated multi-scale context information modules, 3 x C expanded convolution kernels with different expansion factors are respectively applied to the first 7 layers for feature extraction, wherein the convolution is directly performed on the 1 st layer, the expansion convolution is performed on the 2 nd to 6 th layers by using different expansion factors respectively, and the convolution is directly performed on the 7 th layer. Each convolution is followed by a point truncation max (·, 0) to truncate the part of the excess image, keeping the image size the same before and after the convolution. The last layer, layer 8, performs a 1 x 1 convolution. And finally, training the semantic segmentation system so that the semantic segmentation system outputs C64 x 64 refined semantic feature maps.

Step 4.2, expanding the convolution to aggregate the multiscale context information, as shown in fig. 4 and 5, supports expanding the receptive field exponentially without loss of resolution or coverage. The size of the expanded area is (2)ⁱ⁺²-1×)(2ⁱ⁺²-1) a spreading factor of 2ⁱThe size of the receptive field is ((2)ⁱ⁺²-1-)(2ⁱ⁺¹-2×))((2ⁱ⁺²-1-)(2ⁱ⁺¹-2)), i ═ 0,1,. and n-2 denote the second dilation; during expansion, expansion was stopped when the size of the receptive field substantially coincided with the input size, so that the expansion factors for layers 2 to 6 were 1, 2, 4, 8 and 16, respectively, and the expanded receptive field was 5 x 5, 9 x 9, 17 x 17, 33 x 33 and 65 x 65, respectively.

And 4.3, training the semantic segmentation system and the front-end prediction module in the step 3 on the Cityscapes data set in a combined manner, setting the 'person' and 'rider' classes in the Cityscapes data set as pedestrians, and setting the rest classes as backgrounds. The training parameters are specifically set as follows: by using a stochastic gradient descent method (SGD), the minimum batch is 14, the initial learning rate is set to 0.001, after 40000 lots, the learning rate is reduced to 1/10, i.e., to 0.0001, and after 45000 lots, the learning rate continues to decay to 0.00001, which is 60000 lots. And 5: and fusing the result of the semantic segmentation system with the pedestrian candidate result generated by the pedestrian candidate region generator.

Further, the specific steps of step 5 are as follows:

S_Result＝S_YOLOv3×S_ss

wherein: s_ssThe result of the semantic segmentation characteristic diagram output by the semantic segmentation system is the score of the pedestrian; s. the_YOLOv3A score indicating that the pedestrian candidate region result is a pedestrian is output by the pedestrian candidate region generator; s_ResultA score indicating that the final output result is a pedestrian; a. the_BBIs the area of the bounding box; mask (i, j) is the binary mask pixel value at (i, j) in the image; the Kernel (i, j) is the pedestrian Kernel at (i, j) in the image. Kernel, which tends to have higher pixel values in the center than at the boundary, coincides with the object of interest being in the center of the bounding box, and has the effect of enhancing detection, its bounding box fitting the object of interest (e.g., a pedestrian).

Step 6: and outputting the detection image.

In the invention, the pedestrian detection system utilizes YOLOv3 as a pedestrian candidate region generator to generate a large number of pedestrian candidate frames, thereby effectively improving the detection precision of pedestrians; the front-end prediction module and semantic segmentation are utilized to classify the input images at the pixel level, so that the problem of rough detection of regression frame networks such as YOLOv3 is solved, the target detection capability is improved, and the problem of insufficient detection precision of a single network can be effectively solved; fusing the pedestrian candidate frame and the semantic segmentation binary mask by using soft fusion so as to finely output a result; meanwhile, the small target detection capability is improved by combining the two, and the application range is wider; the pedestrian candidate area generator and the semantic segmentation two systems are subjected to parallel networking to form a pedestrian detection system framework, so that rapid detection is realized; the system can accurately, efficiently and robustly detect pedestrians and other target classes in various challenging scenes.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A pedestrian detection method based on deep learning multi-network soft fusion is characterized in that: the method comprises the following steps:

step 1: inputting an image to be processed;

and step 3: inputting the image in the step 1 into a front-end prediction module for intensive prediction, and outputting C feature maps with higher resolution;

step 6: outputting a detection image;

the step 3 comprises the following steps:

step 3.1, modifying the VGG-16 network, converting a complete connection layer into a convolutional layer, deleting the last but one and the last but one maximum pooling and cross-row layers in the VGG-16 network structure so as to obtain a front-end prediction module, performing initialization training by using parameters of an original classification network, and outputting a feature map with higher resolution;

and 3.2, performing intensive prediction on the image to be detected by using a front-end prediction module to generate C64 multiplied by 64 preliminary semantic feature maps.

2. The pedestrian detection method based on deep learning multi-network soft fusion as claimed in claim 1, wherein: the generation of the pedestrian candidate region by the YOLO v3 pedestrian candidate region generator in the step 2 comprises the following steps:

step 2.1, dividing an input picture into S multiplied by S cells, distributing 3 pedestrian candidate region boundary frames to be predicted for each cell, and training YOLO v3 to obtain coordinate position information and confidence corresponding to each predicted pedestrian candidate region boundary frame in the picture;

2.2, fusing 3 scales in a YOLOv3 network, namely crossing layers 32, 16 and 8 from the first layer, and independently detecting pedestrians on the fusion characteristic graphs of the scales to obtain coordinate position information of a pedestrian candidate area;

each cell outputs (1+4+ C) × 3 values, 4 for 4 predicted positioning information, 1 for 1 confidence score, 3 for 3 anchor boxes and C for C conditional class probabilities, where C =1, only pedestrians are classified, so 18 values are output in total; and predicting the coordinate position information of the boundary frame of each pedestrian candidate area by adopting logistic regression:

wherein:

is a Sigmoid activation function that is,

4 predicted positioning information learned for the YOLO v3 web,

，

the width and height of the prior frame are preset,

，

is the coordinate offset of the cell and,

coordinate position information of a boundary frame of a pedestrian candidate area which is finally predicted;

and 2.3, in the process of YOLO v3 training, increasing a confidence coefficient receiving range in an original network of YOLO v3, namely, reducing a confidence coefficient threshold of a detected pedestrian candidate region, and generating a large number of pedestrian candidate regions, so that the candidate regions cover all pedestrians in the image to be detected.

3. The pedestrian detection method based on deep learning multi-network soft fusion as claimed in claim 1, wherein: the semantic segmentation system in the step 4 comprises the following steps:

step 4.1, constructing a semantic segmentation system by utilizing aggregated multi-scale context information, inputting the semantic segmentation system into C64 × 64 preliminary semantic feature maps generated by a front-end prediction module, wherein the semantic segmentation system comprises 8 layers of networks, the front 7 layers are basic aggregated multi-scale context information modules, extracting features of 3 × 3 × C expanded convolution kernels with different expansion factors applied to the front 7 layers respectively, wherein the convolution is directly performed on the 1 st layer, the expansion convolution is performed on the 2 nd to 6 th layers by using different expansion factors respectively, the 7 th layer is directly convolved, point truncation max (·, 0) is performed after each convolution to truncate the part exceeding the image, the sizes of the image before and after the convolution are kept to be the same, the last layer, namely the 8 th layer, performs 1 × 1 convolution, and finally training the semantic segmentation system, so that the semantic segmentation system outputs C64 x 64 refined semantic feature maps;

step 4.2, expanding convolution and aggregation multi-scale context information, supporting expanding the receptive field in an exponential mode without losing resolution or coverage rate, wherein the size of the expansion area is

The expansion factor is

The magnitude of receptive field is

，

Indicates the second expansion; stopping the expansion when the size of the receptive field substantially matches the input size during the expansion process, so that the expansion factors of layers 2 to 6 have sizes of 1, 2, 4, 8 and 16, respectively, and the expanded receptive field sizes are 5 × 5, 9 × 9, 17 × 17, 33 × 33 and 65 × 65, respectively;

and 4.3, training the semantic segmentation system and the front-end prediction module in the step 3 on the Cityscapes data set in a combined manner, setting the 'person' and 'rider' classes in the Cityscapes data set as pedestrians, setting the rest classes as backgrounds, and outputting C binary mask feature maps containing context information.

4. The pedestrian detection method based on deep learning multi-network soft fusion as claimed in any one of claims 1 to 3, wherein: the soft fusion of the step 5 comprises the following specific steps:

step 5.1, generating a binary mask feature map from the semantic feature map in the step 4, wherein foreground pixels are set to be 1 to represent 3 interesting categories, and background pixels are set to be 0;

step 5.2, generating the pedestrian candidate area boundary box generated by the pedestrian candidate area generator in the step 2

Mapping the coordinate position information to a binary mask feature map to obtain a pedestrian candidate region boundary box on the binary mask feature map; scaling the pedestrian candidate region bounding boxes on all the binary mask feature maps to have the same size as the pedestrian kernel;

wherein:

the result of the semantic segmentation characteristic diagram output by the semantic segmentation system is the score of the pedestrian;

a score indicating that the pedestrian candidate region generator outputs a pedestrian candidate region result as a pedestrian;

a score indicating that the final output result is a pedestrian;

is the area of the bounding box; mask and method for manufacturing the same

Is in the image

A binary mask pixel value of (a); core

Is in the image

The pixel value of the Kernel center is often higher than that of the boundary, which is consistent with the center of the interested object in the boundary box, and the Kernel has the effect of enhancing the detection, and the boundary box of the Kernel is suitable for the interested object;

and 5.4, removing the boundary frame of the false detection pedestrians in the pedestrian candidate area in the step 2 according to the r score, and finally obtaining a real pedestrian detection frame.