CN111027493B - Pedestrian detection method based on deep learning multi-network soft fusion - Google Patents
Pedestrian detection method based on deep learning multi-network soft fusion Download PDFInfo
- Publication number
- CN111027493B CN111027493B CN201911284456.4A CN201911284456A CN111027493B CN 111027493 B CN111027493 B CN 111027493B CN 201911284456 A CN201911284456 A CN 201911284456A CN 111027493 B CN111027493 B CN 111027493B
- Authority
- CN
- China
- Prior art keywords
- pedestrian
- pedestrian candidate
- image
- semantic segmentation
- candidate region
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/07—Target detection
Abstract
The invention discloses a pedestrian detection method based on deep learning multi-network soft fusion, and relates to the technical field of image processing, target detection and deep learning; it includes S1: inputting an image to be processed; s2: inputting an image to be processed into a Yolo v3 pedestrian candidate region generator of a network based on Darknet-53 to generate a pedestrian candidate region; s3: inputting an image to be processed into a front-end prediction module, and outputting C feature maps; s4: inputting the C feature maps into a semantic segmentation system, and outputting C feature maps containing context information; s5: fusing the result of the semantic segmentation system with the pedestrian candidate result generated by the pedestrian candidate area generator; s6: and outputting the detection image. The invention combines the soft fusion pedestrian candidate area generator and the semantic segmentation system, efficiently detects pedestrians under various challenging scenes, and simultaneously improves the detection capability of small targets.
Description
Technical Field
The invention relates to the technical field of image processing, target detection and deep learning, in particular to a pedestrian detection method based on deep learning multi-network soft fusion.
Background
Object detection is an important problem in computer vision, which requires detecting the position of an object in a video or digital image. The target detection is widely applied to the fields of image detection, target recognition, video monitoring and the like. Pedestrian detection, a branch of the object detection problem, involves detecting specific human categories, and has wide application in the fields of automatic driving, person recognition, robotics, and the like.
The pedestrian detection algorithm aims at drawing a boundary box in an image or a video and accurately describing the position of a pedestrian in real time. However, this is difficult to achieve due to the tradeoff between accuracy and speed. Because the input with low resolution can realize rapid target detection, but the target detection accuracy is poor; high resolution input may enable more accurate target detection, but at a slower processing speed. When processing relatively simple image scenes and sharp foreground objects, a general pedestrian detection algorithm can achieve good results. But it is more challenging to accurately describe the pedestrian's location in real time in certain circumstances, such as crowded scenes, non-human object occlusions, different appearances of pedestrians (different poses or clothing styles).
The main parts of pedestrian detection can be divided into three parts, namely generation of region proposals, feature extraction and pedestrian confirmation. Conventional methods typically use sliding window based techniques to generate region proposals, histogram of gradient directions (HOG) or Scale Invariant Feature Transform (SIFT) or the like as feature extractors, Support Vector Machines (SVM) or adaptive boosting (AdBoost) or the like as pedestrian confirmation methods; with the development of deep learning, the application of the method in pedestrian detection is more and more, and the mainstream methods are divided into two types: object candidate Based (Object probable Based) and Regression Based (Regression Based) methods. The object candidate Region-based method, also referred to as a second-order method, first generates a set of candidate Bounding boxes (Bounding boxes) that may contain pedestrians by using a Region Proposal (Region pro-posal) module, and then classifies and regresses the Bounding boxes using a deep convolutional neural network. In various pedestrian detection methods based on object candidate regions, improvement and improvement of detection performance are mainly performed based on RCNN, Fast RCNN and Fast RCNN series. The regression-based target detection method is also called a first-order method, and compared with the target candidate region-based method, the regression-based pedestrian detection method is much simpler, does not need candidate region extraction and subsequent resampling operation, and can realize real-time detection to a certain extent, but has lower detection performance than the second-order method. In various regression-based pedestrian detection methods, the method is mainly based on a YOLO series, and an SSD series is improved to improve the detection performance as much as possible, so that real-time and efficient detection is realized.
Disclosure of Invention
The invention aims to: the invention provides a pedestrian detection method based on deep learning multi-network soft fusion, which solves the problem that the position of a pedestrian cannot be accurately described in real time in the face of the balance between the pedestrian detection accuracy and the speed in the conventional method, and can improve the detection capability under the condition of realizing real-time detection.
The technical scheme adopted by the invention is as follows:
a pedestrian detection method based on deep learning multi-network soft fusion comprises the following steps:
step 1: inputting an image to be processed;
step 2: inputting the image in the step 1 into a YOLO v3 pedestrian candidate area generator of a network based on Darknet-53 to generate a pedestrian candidate area;
and step 3: inputting the image in the step 1 into a front-end prediction module, and outputting C feature maps;
and 4, step 4: inputting the C feature maps in the step 3 into a semantic segmentation system, and outputting C binary mask feature maps containing context information;
and 5: performing soft fusion on the result of the semantic segmentation system and the pedestrian candidate result generated by the pedestrian candidate region generator;
step 6: and outputting the detection image.
Preferably, the step 2 comprises the steps of:
step 2.1, dividing an input picture into S multiplied by S cells, distributing 3 pedestrian candidate area boundary frames to be predicted for each cell, and training YOLO v3 to obtain coordinate position information and confidence corresponding to each predicted pedestrian candidate area boundary frame;
2.2, fusing 3 scales in a YOLOv3 network, and independently detecting pedestrians on the fusion characteristic graphs of the scales to obtain coordinate position information of a pedestrian candidate area;
secondly, clustering the data set by using a K-means clustering algorithm to generate an initial value of an anchor frame, distributing 3 anchor frames under each scale, and predicting 3 pedestrian candidate region boundary frames by each cell, wherein the 3 pedestrian candidate region boundary frames correspond to 3 anchor frames, so that 9 anchor frames are totally distributed under 3 scales;
each cell outputs (1+4+ C) × 3 values, 4 represents 4 predicted positioning information, 1 represents 1 confidence score, 3 represents 3 anchor boxes and C represents C conditional class probabilities, where C is 1 and only pedestrians are classified, so 18 values are output;
and predicting the coordinate position information of the boundary frame of each pedestrian candidate area by adopting logistic regression:
bx=σ(tx)+cx
by=σ(ty)+cy
wherein: σ is Sigmoid activation function, (t)x,ty,tw,th) 4 predicted positioning information, p, for Yolo v3 web learningw,phIs the width, height, c, of the preset prior framex,cyIs the coordinate offset of the cell, (b)x,by,bw,bh) Coordinate position information of a boundary frame of a pedestrian candidate area which is finally predicted;
yolo v3 training (t)x,ty,tw,th) The target loss function of (2) is obtained by the following formula:
wherein: lambda [ alpha ]coordAnd λnoobjConstant is the class ratio used to balance the prediction box with the object to the prediction box without the object; t'x、t'y、t'wAnd t'hRepresents a tag value;if the corresponding real object (ground route) is in the jth prediction box of the ith grid point, returning to 1, otherwise, returning to 0;if a j prediction box represented at the ith grid point has a corresponding group route, returning to 0, otherwise, returning to 1, pi(c) Probability of object class, here pedestrian, ci' is the product of the probability of containing an object and the intersection IOU of the predicted bounding box and the label bounding box, i.eciThe intersection IOU value, namely the confidence coefficient, of the predicted bounding box and the label bounding box;
step 2.3, in the process of YOLO v3 training, increasing a confidence receiving range in an original network of YOLO v3, namely reducing a confidence threshold of a detected pedestrian candidate region, generating a large number of pedestrian candidate regions, and ensuring that the candidate regions cover all pedestrians in an image to be detected; the specific setting of the training parameters is as follows: after the initial learning rate was set to 0.001 and 40000 batchs, the learning rate decreased to 1/10, i.e., to 0.0001, and after 45000 batchs, the learning rate continued to decrease to 0.00001, for a total of 50000 batchs.
Preferably, the step 3 comprises the steps of:
step 3.1, modifying the VGG-16 network, converting a complete connection layer into a convolutional layer, deleting the penultimate and third penultimate maximum pooling (Maxpool) and a line-crossing layer in the VGG-16 network structure so as to obtain a front-end prediction module, performing initialization training by using parameters of an original classification network, and outputting a feature map with higher resolution;
and 3.2, using a front-end prediction module to perform intensive prediction on the image to be detected to generate C64 multiplied by 64 preliminary semantic feature maps.
Preferably, the step 4 comprises the steps of:
step 4.1, a semantic segmentation system is constructed by utilizing the aggregated multi-scale context information, the input of the semantic segmentation system is C64 multiplied by 64 preliminary semantic feature maps generated by a front-end prediction module, the semantic segmentation system comprises 8 layers of networks, the front 7 layers are basic aggregated multi-scale context information modules, 3 multiplied by C expansion convolution kernels with different expansion factors are respectively applied to the front 7 layers for feature extraction, wherein the convolution is directly performed on the 1 st layer, the expansion convolution is performed on the 2 nd to 6 th layers by using different expansion factors, and the convolution is directly performed on the 7 th layer. Each convolution is followed by a point truncation max (·, 0) to truncate the part of the excess image, keeping the image size the same before and after the convolution. The last layer, layer 8, performs a 1 x 1 convolution. And finally, training the semantic segmentation system so that the semantic segmentation system outputs C64 x 64 refined semantic feature maps.
And 4.2, expanding convolution and aggregation multi-scale context information, and supporting exponential expansion of the receptive field without losing resolution or coverage rate. The size of the expanded area is (2)i+2-1)×(2i+2-1) a spreading factor of 2iThe size of the receptive field is ((2)i+2-1)-(2i+1-2))×((2i+2-1)-(2i+1-2)), i ═ 0,1,. and n-2 denote the second dilation; during expansion, expansion was stopped when the size of the receptive field substantially coincided with the input size, so that the expansion factors for layers 2 to 6 were 1, 2, 4, 8 and 16, respectively, and the expanded receptive field was 5 x 5, 9 x 9, 17 x 17, 33 x 33 and 65 x 65, respectively.
And 4.3, training the semantic segmentation system and the front-end prediction module in the step 3 on the Cityscapes data set in a combined manner, setting the 'person' and 'rider' classes in the Cityscapes data set as pedestrians, and setting the rest classes as backgrounds. The training parameters are specifically set as follows: by using a stochastic gradient descent method (SGD), the minimum batch is 14, the initial learning rate is set to 0.001, after 40000 lots, the learning rate is reduced to 1/10, i.e., to 0.0001, and after 45000 lots, the learning rate continues to decay to 0.00001, which is 60000 lots.
Preferably, the specific steps of step 5 are:
and 5.1, generating a binary mask feature map from the semantic feature map in the step 4, wherein the foreground pixels are set to be 1 to represent interested categories (such as pedestrians), and the background pixels are set to be 0.
Step 5.2, generating the pedestrian candidate area bounding box (b) generated by the pedestrian candidate area generator in the step 2x,by,bw,bh) Mapping the coordinate position information to a binary mask feature map to obtain a pedestrian candidate region boundary box on the binary mask feature map; scaling the pedestrian candidate region bounding boxes on all the binary mask feature maps to have the same size as the pedestrian kernel;
step 5.3, the soft fusion scale factor is used for weighting and calculating the pixels and the pedestrian kernels in the boundary box of the pedestrian candidate area on the binary mask feature map, and the calculation mode is as follows:
SResult=SYOLOv3×Sss
wherein: sssThe result of the semantic segmentation characteristic graph output by the semantic segmentation system is the score of the pedestrian; sYOLOv3A score indicating that the pedestrian candidate region generator outputs a pedestrian candidate region result as a pedestrian; sResultA score indicating that the final output result is a pedestrian; a. theBBIs the area of the bounding box; mask (i, j) is the binary mask pixel value at (i, j) in the image; kernel (i, j)) Is the pedestrian kernel at (i, j) in the image. Kernel, which tends to have higher pixel values in the center than at the boundary, coincides with the object of interest being in the center of the bounding box, and has the effect of enhancing detection, its bounding box fitting the object of interest (e.g., a pedestrian).
Step 5.4, according to SResultAnd (4) removing the boundary frame of the false detection pedestrian in the pedestrian candidate area in the step (2) to finally obtain the real pedestrian detection frame.
In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:
1. the invention effectively improves the pedestrian detection precision by using YOLOv3 as a pedestrian candidate region generator to generate a large number of pedestrian candidate frames;
2. the invention utilizes the front-end prediction module and semantic segmentation to classify the input images at pixel level, thereby avoiding the problem of rough detection of regression frame networks such as YOLOv3 and the like, improving the target detection capability and effectively solving the problem of insufficient detection precision of a single network;
3. the method utilizes soft fusion to fuse the pedestrian candidate frame and the semantic segmentation binary mask, thereby finely outputting a result; meanwhile, the small target detection capability is improved by combining the two, and the application range is wider;
4. the pedestrian detection system framework is formed by parallelly networking a pedestrian candidate area generator and a semantic segmentation system, so that rapid detection is realized; the system can accurately, efficiently and robustly detect pedestrians and other target classes in various challenging scenes.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
FIG. 1 is a flow chart of a pedestrian detection system of the present invention;
FIG. 2 is a network structure of the candidate pedestrian generator YOLOv3 of FIG. 1 according to the present invention;
FIG. 3 is a coordinate transformation formula diagram of the Bounding Box in FIG. 2 according to the present invention;
FIG. 4 is a schematic diagram of the architecture of the front-end prediction module base network VGG-16 of FIG. 1 in accordance with the present invention;
FIG. 5 is a convolution structure of the 0 th dilation in the semantic segmentation system of FIG. 1 according to the present invention;
FIG. 6 is a convolution structure of the 1 st dilation in the semantic segmentation system of FIG. 1 according to the present invention;
FIG. 7 is a graph showing the results of the soft fusion of FIG. 1 according to the present invention;
FIG. 8 is a diagram illustrating a context network architecture of the semantic segmentation system of FIG. 1 according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the detailed description and specific examples, while indicating the preferred embodiment of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
It is noted that relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The features and properties of the present invention are described in further detail below with reference to examples.
Example one
A pedestrian detection method based on deep learning multi-network soft fusion is characterized in that a flow chart of an implementation mode is shown in figure 1 and comprises two parallel operation parts of pedestrian candidate region extraction and pedestrian semantic segmentation, wherein the final pedestrian detection result of the whole system is refined through semantic segmentation, the operation speed of the system depends on a branch with slow processing, and finally the two results are fused and output in a soft fusion mode. The method specifically comprises the following steps:
step 1: and inputting an image to be processed.
Step 2: inputting the image in step 1 into a YOLOv3 pedestrian candidate area generator of the Darknet-53-based network in FIG. 2, and generating a pedestrian candidate area.
Further, the implementation steps of YOLOv3 in step 2 are as follows:
step 2.1, firstly, 3 scales (13 × 13, 26 × 26 and 52 × 52) are fused in the YOLOv3 network, detection is independently performed on the fusion feature maps of the multiple scales respectively, and the detection effect on the small target is enhanced. Secondly, clustering the data set by using a K-means clustering algorithm to generate an initial value of an Anchor Box, distributing 3 Anchor boxes under each scale, predicting 3 Bounding boxes for each cell, corresponding to the 3 Anchor boxes, and outputting (1+4+ C) × 3 values (4 positioning information, 1 confidence score and C conditional class probabilities) for each cell. Finally, the 4-dimensional position value t is calculated by the following formulax,ty,tw,thDecoding is performed as shown in FIG. 3Obtaining the central point coordinates (x, y) and the width and height (w, h) of the prediction frame:
bx=σ(tx)+cx
by=σ(ty)+cy
wherein: sigma (t)x),σ(ty) Is the offset based on the coordinate of the grid point at the upper left corner of the center point of the rectangular frame, sigma is the Sigmoid activation function, pw,phThe width and height of the prior frame are calculated by the formulaw,bh。
The multitask training target loss function of YOLO v3 is given by the following equation:
wherein: λ is used to balance the class ratio of the prediction box with the object to the prediction box without the object; t'x、t'y、t'wAnd t'hRepresents a tag value;if the corresponding group route is in the jth prediction box represented at the ith grid point, returning to 1, otherwise, returning to 0;and if the jth prediction frame which represents the ith grid point has the corresponding group route, returning to 0, otherwise, returning to 1.
And 2.2, by utilizing the characteristic that each pedestrian candidate region is associated with the coordinate of the positioning frame and the confidence score of the pedestrian candidate region, firstly reducing the confidence threshold of the YOLO v3 detection candidate region, then generating a large number of candidate regions, and finally detecting all real pedestrians.
And 2.3, loading a pre-training model Darknet-53 obtained by training on ImageNet, deleting an original classifier, then carrying out fine tuning training on a Cityscapes data set, using an Adam optimizer in the training process, and expanding a model training sample by adopting a horizontal turning, angle adjustment, exposure, hue, saturation and other data enhancement mode when training the model, thereby enhancing the generalization performance of the model and reducing overfitting. After the initial learning rate was set to 0.001 and 40000 batchs, the learning rate decreased to 1/10, i.e., to 0.0001, and after 45000 batchs, the learning rate continued to decrease to 0.00001.
And step 3: and (3) inputting the image in the step (1) into a front-end prediction module, and outputting C characteristic graphs.
Further, the specific steps of step 3 are as follows:
and 3.1, converting the complete connection layer in the VGG-16 into a convolutional layer, and deleting the penultimate and third Maxpool and the cross-row layer in the VGG-16 network structure. Specifically, each of the Maxpool layer and the cross-row layer is removed, and for each deleted layer, the convolutions in all layers thereafter are enlarged by a factor of 2, and the convolutions in all layers thereafter are enlarged by a factor of 2 for each deleted layer. Thus, the convolution in the final layer after the two deleted layers is expanded by a factor of 4 and initialized using the parameters of the original classification network, resulting in a higher resolution output. Finally, a feature map is generated at a resolution of 64 × 64.
And 3.2, adjusting the VGG-16 network structure in the figure 4 to obtain a front-end prediction module so as to perform intensive prediction.
And 4, step 4: and (4) inputting the C feature maps in the step (3) into a semantic segmentation system, and outputting C feature maps containing context information.
Further, the specific steps of step 4 are as follows:
step 4.1, a semantic segmentation system is constructed by utilizing aggregated multi-scale context information, the input of the semantic segmentation system is C64 x 64 preliminary semantic feature maps generated by a front-end prediction module, the semantic segmentation system comprises 8 layers of networks, the network structure form of the semantic segmentation system is shown in FIG. 8, the first 7 layers are basic aggregated multi-scale context information modules, 3 x C expanded convolution kernels with different expansion factors are respectively applied to the first 7 layers for feature extraction, wherein the convolution is directly performed on the 1 st layer, the expansion convolution is performed on the 2 nd to 6 th layers by using different expansion factors respectively, and the convolution is directly performed on the 7 th layer. Each convolution is followed by a point truncation max (·, 0) to truncate the part of the excess image, keeping the image size the same before and after the convolution. The last layer, layer 8, performs a 1 x 1 convolution. And finally, training the semantic segmentation system so that the semantic segmentation system outputs C64 x 64 refined semantic feature maps.
Step 4.2, expanding the convolution to aggregate the multiscale context information, as shown in fig. 4 and 5, supports expanding the receptive field exponentially without loss of resolution or coverage. The size of the expanded area is (2)i+2-1×)(2i+2-1) a spreading factor of 2iThe size of the receptive field is ((2)i+2-1-)(2i+1-2×))((2i+2-1-)(2i+1-2)), i ═ 0,1,. and n-2 denote the second dilation; during expansion, expansion was stopped when the size of the receptive field substantially coincided with the input size, so that the expansion factors for layers 2 to 6 were 1, 2, 4, 8 and 16, respectively, and the expanded receptive field was 5 x 5, 9 x 9, 17 x 17, 33 x 33 and 65 x 65, respectively.
And 4.3, training the semantic segmentation system and the front-end prediction module in the step 3 on the Cityscapes data set in a combined manner, setting the 'person' and 'rider' classes in the Cityscapes data set as pedestrians, and setting the rest classes as backgrounds. The training parameters are specifically set as follows: by using a stochastic gradient descent method (SGD), the minimum batch is 14, the initial learning rate is set to 0.001, after 40000 lots, the learning rate is reduced to 1/10, i.e., to 0.0001, and after 45000 lots, the learning rate continues to decay to 0.00001, which is 60000 lots. And 5: and fusing the result of the semantic segmentation system with the pedestrian candidate result generated by the pedestrian candidate region generator.
Further, the specific steps of step 5 are as follows:
and 5.1, generating a binary mask feature map from the semantic feature map in the step 4, wherein the foreground pixels are set to be 1 to represent interested categories (such as pedestrians), and the background pixels are set to be 0.
Step 5.2, generating the pedestrian candidate area bounding box (b) generated by the pedestrian candidate area generator in the step 2x,by,bw,bh) Mapping the coordinate position information to a binary mask feature map to obtain a pedestrian candidate region boundary box on the binary mask feature map; scaling the pedestrian candidate region bounding boxes on all the binary mask feature maps to have the same size as the pedestrian kernel;
step 5.3, the soft fusion scale factor is used for weighting and calculating the pixels and the pedestrian kernels in the boundary box of the pedestrian candidate area on the binary mask feature map, and the calculation mode is as follows:
SResult=SYOLOv3×Sss
wherein: sssThe result of the semantic segmentation characteristic diagram output by the semantic segmentation system is the score of the pedestrian; s. theYOLOv3A score indicating that the pedestrian candidate region result is a pedestrian is output by the pedestrian candidate region generator; sResultA score indicating that the final output result is a pedestrian; a. theBBIs the area of the bounding box; mask (i, j) is the binary mask pixel value at (i, j) in the image; the Kernel (i, j) is the pedestrian Kernel at (i, j) in the image. Kernel, which tends to have higher pixel values in the center than at the boundary, coincides with the object of interest being in the center of the bounding box, and has the effect of enhancing detection, its bounding box fitting the object of interest (e.g., a pedestrian).
Step 5.4, according to SResultAnd (4) removing the boundary frame of the false detection pedestrian in the pedestrian candidate area in the step (2) to finally obtain the real pedestrian detection frame.
Step 6: and outputting the detection image.
In the invention, the pedestrian detection system utilizes YOLOv3 as a pedestrian candidate region generator to generate a large number of pedestrian candidate frames, thereby effectively improving the detection precision of pedestrians; the front-end prediction module and semantic segmentation are utilized to classify the input images at the pixel level, so that the problem of rough detection of regression frame networks such as YOLOv3 is solved, the target detection capability is improved, and the problem of insufficient detection precision of a single network can be effectively solved; fusing the pedestrian candidate frame and the semantic segmentation binary mask by using soft fusion so as to finely output a result; meanwhile, the small target detection capability is improved by combining the two, and the application range is wider; the pedestrian candidate area generator and the semantic segmentation two systems are subjected to parallel networking to form a pedestrian detection system framework, so that rapid detection is realized; the system can accurately, efficiently and robustly detect pedestrians and other target classes in various challenging scenes.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.
Claims (4)
1. A pedestrian detection method based on deep learning multi-network soft fusion is characterized in that: the method comprises the following steps:
step 1: inputting an image to be processed;
step 2: inputting the image in the step 1 into a YOLO v3 pedestrian candidate area generator of a network based on Darknet-53 to generate a pedestrian candidate area;
and step 3: inputting the image in the step 1 into a front-end prediction module for intensive prediction, and outputting C feature maps with higher resolution;
and 4, step 4: inputting the C feature maps in the step 3 into a semantic segmentation system, and outputting C binary mask feature maps containing context information;
and 5: performing soft fusion on the result of the semantic segmentation system and the pedestrian candidate result generated by the pedestrian candidate region generator;
step 6: outputting a detection image;
the step 3 comprises the following steps:
step 3.1, modifying the VGG-16 network, converting a complete connection layer into a convolutional layer, deleting the last but one and the last but one maximum pooling and cross-row layers in the VGG-16 network structure so as to obtain a front-end prediction module, performing initialization training by using parameters of an original classification network, and outputting a feature map with higher resolution;
and 3.2, performing intensive prediction on the image to be detected by using a front-end prediction module to generate C64 multiplied by 64 preliminary semantic feature maps.
2. The pedestrian detection method based on deep learning multi-network soft fusion as claimed in claim 1, wherein: the generation of the pedestrian candidate region by the YOLO v3 pedestrian candidate region generator in the step 2 comprises the following steps:
step 2.1, dividing an input picture into S multiplied by S cells, distributing 3 pedestrian candidate region boundary frames to be predicted for each cell, and training YOLO v3 to obtain coordinate position information and confidence corresponding to each predicted pedestrian candidate region boundary frame in the picture;
2.2, fusing 3 scales in a YOLOv3 network, namely crossing layers 32, 16 and 8 from the first layer, and independently detecting pedestrians on the fusion characteristic graphs of the scales to obtain coordinate position information of a pedestrian candidate area;
secondly, clustering the data set by using a K-means clustering algorithm to generate an initial value of an anchor frame, distributing 3 anchor frames under each scale, and predicting 3 pedestrian candidate region boundary frames by each cell, wherein the 3 pedestrian candidate region boundary frames correspond to 3 anchor frames, so that 9 anchor frames are totally distributed under 3 scales;
each cell outputs (1+4+ C) × 3 values, 4 for 4 predicted positioning information, 1 for 1 confidence score, 3 for 3 anchor boxes and C for C conditional class probabilities, where C =1, only pedestrians are classified, so 18 values are output in total; and predicting the coordinate position information of the boundary frame of each pedestrian candidate area by adopting logistic regression:
wherein:is a Sigmoid activation function that is,4 predicted positioning information learned for the YOLO v3 web,,the width and height of the prior frame are preset,,is the coordinate offset of the cell and,coordinate position information of a boundary frame of a pedestrian candidate area which is finally predicted;
and 2.3, in the process of YOLO v3 training, increasing a confidence coefficient receiving range in an original network of YOLO v3, namely, reducing a confidence coefficient threshold of a detected pedestrian candidate region, and generating a large number of pedestrian candidate regions, so that the candidate regions cover all pedestrians in the image to be detected.
3. The pedestrian detection method based on deep learning multi-network soft fusion as claimed in claim 1, wherein: the semantic segmentation system in the step 4 comprises the following steps:
step 4.1, constructing a semantic segmentation system by utilizing aggregated multi-scale context information, inputting the semantic segmentation system into C64 × 64 preliminary semantic feature maps generated by a front-end prediction module, wherein the semantic segmentation system comprises 8 layers of networks, the front 7 layers are basic aggregated multi-scale context information modules, extracting features of 3 × 3 × C expanded convolution kernels with different expansion factors applied to the front 7 layers respectively, wherein the convolution is directly performed on the 1 st layer, the expansion convolution is performed on the 2 nd to 6 th layers by using different expansion factors respectively, the 7 th layer is directly convolved, point truncation max (·, 0) is performed after each convolution to truncate the part exceeding the image, the sizes of the image before and after the convolution are kept to be the same, the last layer, namely the 8 th layer, performs 1 × 1 convolution, and finally training the semantic segmentation system, so that the semantic segmentation system outputs C64 x 64 refined semantic feature maps;
step 4.2, expanding convolution and aggregation multi-scale context information, supporting expanding the receptive field in an exponential mode without losing resolution or coverage rate, wherein the size of the expansion area isThe expansion factor isThe magnitude of receptive field is,Indicates the second expansion; stopping the expansion when the size of the receptive field substantially matches the input size during the expansion process, so that the expansion factors of layers 2 to 6 have sizes of 1, 2, 4, 8 and 16, respectively, and the expanded receptive field sizes are 5 × 5, 9 × 9, 17 × 17, 33 × 33 and 65 × 65, respectively;
and 4.3, training the semantic segmentation system and the front-end prediction module in the step 3 on the Cityscapes data set in a combined manner, setting the 'person' and 'rider' classes in the Cityscapes data set as pedestrians, setting the rest classes as backgrounds, and outputting C binary mask feature maps containing context information.
4. The pedestrian detection method based on deep learning multi-network soft fusion as claimed in any one of claims 1 to 3, wherein: the soft fusion of the step 5 comprises the following specific steps:
step 5.1, generating a binary mask feature map from the semantic feature map in the step 4, wherein foreground pixels are set to be 1 to represent 3 interesting categories, and background pixels are set to be 0;
step 5.2, generating the pedestrian candidate area boundary box generated by the pedestrian candidate area generator in the step 2Mapping the coordinate position information to a binary mask feature map to obtain a pedestrian candidate region boundary box on the binary mask feature map; scaling the pedestrian candidate region bounding boxes on all the binary mask feature maps to have the same size as the pedestrian kernel;
step 5.3, the soft fusion scale factor is used for weighting and calculating the pixels and the pedestrian kernels in the boundary box of the pedestrian candidate area on the binary mask feature map, and the calculation mode is as follows:
wherein:the result of the semantic segmentation characteristic diagram output by the semantic segmentation system is the score of the pedestrian;a score indicating that the pedestrian candidate region generator outputs a pedestrian candidate region result as a pedestrian;a score indicating that the final output result is a pedestrian;is the area of the bounding box; mask and method for manufacturing the sameIs in the imageA binary mask pixel value of (a); coreIs in the imageThe pixel value of the Kernel center is often higher than that of the boundary, which is consistent with the center of the interested object in the boundary box, and the Kernel has the effect of enhancing the detection, and the boundary box of the Kernel is suitable for the interested object;
and 5.4, removing the boundary frame of the false detection pedestrians in the pedestrian candidate area in the step 2 according to the r score, and finally obtaining a real pedestrian detection frame.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911284456.4A CN111027493B (en) | 2019-12-13 | 2019-12-13 | Pedestrian detection method based on deep learning multi-network soft fusion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911284456.4A CN111027493B (en) | 2019-12-13 | 2019-12-13 | Pedestrian detection method based on deep learning multi-network soft fusion |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111027493A CN111027493A (en) | 2020-04-17 |
CN111027493B true CN111027493B (en) | 2022-05-20 |
Family
ID=70208997
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911284456.4A Active CN111027493B (en) | 2019-12-13 | 2019-12-13 | Pedestrian detection method based on deep learning multi-network soft fusion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111027493B (en) |
Families Citing this family (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111626156B (en) * | 2020-05-14 | 2023-05-09 | 电子科技大学 | Pedestrian generation method based on pedestrian mask and multi-scale discrimination |
CN111860160B (en) * | 2020-06-16 | 2023-12-12 | 国能信控互联技术有限公司 | Method for detecting wearing of mask indoors |
CN111783784A (en) * | 2020-06-30 | 2020-10-16 | 创新奇智(合肥)科技有限公司 | Method and device for detecting building cavity, electronic equipment and storage medium |
CN111931729B (en) * | 2020-09-23 | 2021-01-08 | 平安国际智慧城市科技股份有限公司 | Pedestrian detection method, device, equipment and medium based on artificial intelligence |
CN112329660A (en) * | 2020-11-10 | 2021-02-05 | 浙江商汤科技开发有限公司 | Scene recognition method and device, intelligent equipment and storage medium |
CN112633086B (en) * | 2020-12-09 | 2024-01-26 | 西安电子科技大学 | Near-infrared pedestrian monitoring method, system, medium and equipment based on multitasking EfficientDet |
CN112507904B (en) * | 2020-12-15 | 2022-06-03 | 重庆邮电大学 | Real-time classroom human body posture detection method based on multi-scale features |
CN112668560B (en) * | 2021-03-16 | 2021-07-30 | 中国矿业大学(北京) | Pedestrian detection method and system for pedestrian flow dense area |
CN112966697B (en) * | 2021-03-17 | 2022-03-11 | 西安电子科技大学广州研究院 | Target detection method, device and equipment based on scene semantics and storage medium |
CN113011389B (en) * | 2021-04-23 | 2022-07-26 | 电子科技大学 | Road pedestrian small target detection method based on clustering idea |
CN113536985A (en) * | 2021-06-29 | 2021-10-22 | 中国铁道科学研究院集团有限公司电子计算技术研究所 | Depth-of-field attention network-based passenger flow distribution statistical method and device |
CN114005268A (en) * | 2021-10-21 | 2022-02-01 | 广州通达汽车电气股份有限公司 | Bus interval scheduling method, device, equipment and storage medium |
CN116602663B (en) * | 2023-06-02 | 2023-12-15 | 深圳市震有智联科技有限公司 | Intelligent monitoring method and system based on millimeter wave radar |
CN117475389B (en) * | 2023-12-27 | 2024-03-15 | 山东海润数聚科技有限公司 | Pedestrian crossing signal lamp control method, system, equipment and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106709568A (en) * | 2016-12-16 | 2017-05-24 | 北京工业大学 | RGB-D image object detection and semantic segmentation method based on deep convolution network |
CN108288075A (en) * | 2018-02-02 | 2018-07-17 | 沈阳工业大学 | A kind of lightweight small target detecting method improving SSD |
CN108875595A (en) * | 2018-05-29 | 2018-11-23 | 重庆大学 | A kind of Driving Scene object detection method merged based on deep learning and multilayer feature |
CN109063559A (en) * | 2018-06-28 | 2018-12-21 | 东南大学 | A kind of pedestrian detection method returned based on improvement region |
CN109508710A (en) * | 2018-10-23 | 2019-03-22 | 东华大学 | Based on the unmanned vehicle night-environment cognitive method for improving YOLOv3 network |
CN109543754A (en) * | 2018-11-23 | 2019-03-29 | 中山大学 | The parallel method of target detection and semantic segmentation based on end-to-end deep learning |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108416327B (en) * | 2018-03-28 | 2022-04-29 | 京东方科技集团股份有限公司 | Target detection method and device, computer equipment and readable storage medium |
CN108960340B (en) * | 2018-07-23 | 2021-08-31 | 电子科技大学 | Convolutional neural network compression method and face detection method |
CN109816100B (en) * | 2019-01-30 | 2020-09-01 | 中科人工智能创新技术研究院(青岛)有限公司 | Salient object detection method and device based on bidirectional fusion network |
-
2019
- 2019-12-13 CN CN201911284456.4A patent/CN111027493B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106709568A (en) * | 2016-12-16 | 2017-05-24 | 北京工业大学 | RGB-D image object detection and semantic segmentation method based on deep convolution network |
CN108288075A (en) * | 2018-02-02 | 2018-07-17 | 沈阳工业大学 | A kind of lightweight small target detecting method improving SSD |
CN108875595A (en) * | 2018-05-29 | 2018-11-23 | 重庆大学 | A kind of Driving Scene object detection method merged based on deep learning and multilayer feature |
CN109063559A (en) * | 2018-06-28 | 2018-12-21 | 东南大学 | A kind of pedestrian detection method returned based on improvement region |
CN109508710A (en) * | 2018-10-23 | 2019-03-22 | 东华大学 | Based on the unmanned vehicle night-environment cognitive method for improving YOLOv3 network |
CN109543754A (en) * | 2018-11-23 | 2019-03-29 | 中山大学 | The parallel method of target detection and semantic segmentation based on end-to-end deep learning |
Non-Patent Citations (1)
Title |
---|
寇大磊 ; 权冀川 ; 张仲伟.基于深度学习的目标检测框架进展研究.《计算机工程与应用》.2019, * |
Also Published As
Publication number | Publication date |
---|---|
CN111027493A (en) | 2020-04-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111027493B (en) | Pedestrian detection method based on deep learning multi-network soft fusion | |
CN108416266B (en) | Method for rapidly identifying video behaviors by extracting moving object through optical flow | |
CN112396002A (en) | Lightweight remote sensing target detection method based on SE-YOLOv3 | |
CN111179217A (en) | Attention mechanism-based remote sensing image multi-scale target detection method | |
CN112966691B (en) | Multi-scale text detection method and device based on semantic segmentation and electronic equipment | |
JP2006209755A (en) | Method for tracing moving object inside frame sequence acquired from scene | |
CN111160249A (en) | Multi-class target detection method of optical remote sensing image based on cross-scale feature fusion | |
CN111274981B (en) | Target detection network construction method and device and target detection method | |
CN111401293B (en) | Gesture recognition method based on Head lightweight Mask scanning R-CNN | |
WO2020077940A1 (en) | Method and device for automatic identification of labels of image | |
CN109165658B (en) | Strong negative sample underwater target detection method based on fast-RCNN | |
CN114049381A (en) | Twin cross target tracking method fusing multilayer semantic information | |
CN111553414A (en) | In-vehicle lost object detection method based on improved Faster R-CNN | |
CN112381030B (en) | Satellite optical remote sensing image target detection method based on feature fusion | |
CN114998595B (en) | Weak supervision semantic segmentation method, semantic segmentation method and readable storage medium | |
CN114861842B (en) | Few-sample target detection method and device and electronic equipment | |
CN114882423A (en) | Truck warehousing goods identification method based on improved Yolov5m model and Deepsort | |
Ren et al. | Research on infrared small target segmentation algorithm based on improved mask R-CNN | |
CN111931572B (en) | Target detection method for remote sensing image | |
CN111738069A (en) | Face detection method and device, electronic equipment and storage medium | |
CN116245843A (en) | Vehicle paint defect detection and segmentation integrated method based on YOLOv5 frame | |
CN116091946A (en) | Yolov 5-based unmanned aerial vehicle aerial image target detection method | |
CN115953743A (en) | Parking space state identification method based on improved YOLO model | |
CN114693997A (en) | Image description generation method, device, equipment and medium based on transfer learning | |
CN114332754A (en) | Cascade R-CNN pedestrian detection method based on multi-metric detector |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |