CN111027493B - Pedestrian detection method based on deep learning multi-network soft fusion - Google Patents

Pedestrian detection method based on deep learning multi-network soft fusion Download PDF

Info

Publication number
CN111027493B
CN111027493B CN201911284456.4A CN201911284456A CN111027493B CN 111027493 B CN111027493 B CN 111027493B CN 201911284456 A CN201911284456 A CN 201911284456A CN 111027493 B CN111027493 B CN 111027493B
Authority
CN
China
Prior art keywords
pedestrian
pedestrian candidate
image
semantic segmentation
candidate region
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911284456.4A
Other languages
Chinese (zh)
Other versions
CN111027493A (en
Inventor
袁国慧
叶涛
王卓然
彭真明
潘为年
柳杨
孙煜成
周宇
杨博文
张文超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN201911284456.4A priority Critical patent/CN111027493B/en
Publication of CN111027493A publication Critical patent/CN111027493A/en
Application granted granted Critical
Publication of CN111027493B publication Critical patent/CN111027493B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Abstract

The invention discloses a pedestrian detection method based on deep learning multi-network soft fusion, and relates to the technical field of image processing, target detection and deep learning; it includes S1: inputting an image to be processed; s2: inputting an image to be processed into a Yolo v3 pedestrian candidate region generator of a network based on Darknet-53 to generate a pedestrian candidate region; s3: inputting an image to be processed into a front-end prediction module, and outputting C feature maps; s4: inputting the C feature maps into a semantic segmentation system, and outputting C feature maps containing context information; s5: fusing the result of the semantic segmentation system with the pedestrian candidate result generated by the pedestrian candidate area generator; s6: and outputting the detection image. The invention combines the soft fusion pedestrian candidate area generator and the semantic segmentation system, efficiently detects pedestrians under various challenging scenes, and simultaneously improves the detection capability of small targets.

Description

Pedestrian detection method based on deep learning multi-network soft fusion
Technical Field
The invention relates to the technical field of image processing, target detection and deep learning, in particular to a pedestrian detection method based on deep learning multi-network soft fusion.
Background
Object detection is an important problem in computer vision, which requires detecting the position of an object in a video or digital image. The target detection is widely applied to the fields of image detection, target recognition, video monitoring and the like. Pedestrian detection, a branch of the object detection problem, involves detecting specific human categories, and has wide application in the fields of automatic driving, person recognition, robotics, and the like.
The pedestrian detection algorithm aims at drawing a boundary box in an image or a video and accurately describing the position of a pedestrian in real time. However, this is difficult to achieve due to the tradeoff between accuracy and speed. Because the input with low resolution can realize rapid target detection, but the target detection accuracy is poor; high resolution input may enable more accurate target detection, but at a slower processing speed. When processing relatively simple image scenes and sharp foreground objects, a general pedestrian detection algorithm can achieve good results. But it is more challenging to accurately describe the pedestrian's location in real time in certain circumstances, such as crowded scenes, non-human object occlusions, different appearances of pedestrians (different poses or clothing styles).
The main parts of pedestrian detection can be divided into three parts, namely generation of region proposals, feature extraction and pedestrian confirmation. Conventional methods typically use sliding window based techniques to generate region proposals, histogram of gradient directions (HOG) or Scale Invariant Feature Transform (SIFT) or the like as feature extractors, Support Vector Machines (SVM) or adaptive boosting (AdBoost) or the like as pedestrian confirmation methods; with the development of deep learning, the application of the method in pedestrian detection is more and more, and the mainstream methods are divided into two types: object candidate Based (Object probable Based) and Regression Based (Regression Based) methods. The object candidate Region-based method, also referred to as a second-order method, first generates a set of candidate Bounding boxes (Bounding boxes) that may contain pedestrians by using a Region Proposal (Region pro-posal) module, and then classifies and regresses the Bounding boxes using a deep convolutional neural network. In various pedestrian detection methods based on object candidate regions, improvement and improvement of detection performance are mainly performed based on RCNN, Fast RCNN and Fast RCNN series. The regression-based target detection method is also called a first-order method, and compared with the target candidate region-based method, the regression-based pedestrian detection method is much simpler, does not need candidate region extraction and subsequent resampling operation, and can realize real-time detection to a certain extent, but has lower detection performance than the second-order method. In various regression-based pedestrian detection methods, the method is mainly based on a YOLO series, and an SSD series is improved to improve the detection performance as much as possible, so that real-time and efficient detection is realized.
Disclosure of Invention
The invention aims to: the invention provides a pedestrian detection method based on deep learning multi-network soft fusion, which solves the problem that the position of a pedestrian cannot be accurately described in real time in the face of the balance between the pedestrian detection accuracy and the speed in the conventional method, and can improve the detection capability under the condition of realizing real-time detection.
The technical scheme adopted by the invention is as follows:
a pedestrian detection method based on deep learning multi-network soft fusion comprises the following steps:
step 1: inputting an image to be processed;
step 2: inputting the image in the step 1 into a YOLO v3 pedestrian candidate area generator of a network based on Darknet-53 to generate a pedestrian candidate area;
and step 3: inputting the image in the step 1 into a front-end prediction module, and outputting C feature maps;
and 4, step 4: inputting the C feature maps in the step 3 into a semantic segmentation system, and outputting C binary mask feature maps containing context information;
and 5: performing soft fusion on the result of the semantic segmentation system and the pedestrian candidate result generated by the pedestrian candidate region generator;
step 6: and outputting the detection image.
Preferably, the step 2 comprises the steps of:
step 2.1, dividing an input picture into S multiplied by S cells, distributing 3 pedestrian candidate area boundary frames to be predicted for each cell, and training YOLO v3 to obtain coordinate position information and confidence corresponding to each predicted pedestrian candidate area boundary frame;
2.2, fusing 3 scales in a YOLOv3 network, and independently detecting pedestrians on the fusion characteristic graphs of the scales to obtain coordinate position information of a pedestrian candidate area;
secondly, clustering the data set by using a K-means clustering algorithm to generate an initial value of an anchor frame, distributing 3 anchor frames under each scale, and predicting 3 pedestrian candidate region boundary frames by each cell, wherein the 3 pedestrian candidate region boundary frames correspond to 3 anchor frames, so that 9 anchor frames are totally distributed under 3 scales;
each cell outputs (1+4+ C) × 3 values, 4 represents 4 predicted positioning information, 1 represents 1 confidence score, 3 represents 3 anchor boxes and C represents C conditional class probabilities, where C is 1 and only pedestrians are classified, so 18 values are output;
and predicting the coordinate position information of the boundary frame of each pedestrian candidate area by adopting logistic regression:
bx=σ(tx)+cx
by=σ(ty)+cy
Figure GDA0002399116890000021
Figure GDA0002399116890000022
wherein: σ is Sigmoid activation function, (t)x,ty,tw,th) 4 predicted positioning information, p, for Yolo v3 web learningw,phIs the width, height, c, of the preset prior framex,cyIs the coordinate offset of the cell, (b)x,by,bw,bh) Coordinate position information of a boundary frame of a pedestrian candidate area which is finally predicted;
yolo v3 training (t)x,ty,tw,th) The target loss function of (2) is obtained by the following formula:
Figure GDA0002399116890000031
wherein: lambda [ alpha ]coordAnd λnoobjConstant is the class ratio used to balance the prediction box with the object to the prediction box without the object; t'x、t'y、t'wAnd t'hRepresents a tag value;
Figure GDA0002399116890000032
if the corresponding real object (ground route) is in the jth prediction box of the ith grid point, returning to 1, otherwise, returning to 0;
Figure GDA0002399116890000033
if a j prediction box represented at the ith grid point has a corresponding group route, returning to 0, otherwise, returning to 1, pi(c) Probability of object class, here pedestrian, ci' is the product of the probability of containing an object and the intersection IOU of the predicted bounding box and the label bounding box, i.e
Figure GDA0002399116890000034
ciThe intersection IOU value, namely the confidence coefficient, of the predicted bounding box and the label bounding box;
step 2.3, in the process of YOLO v3 training, increasing a confidence receiving range in an original network of YOLO v3, namely reducing a confidence threshold of a detected pedestrian candidate region, generating a large number of pedestrian candidate regions, and ensuring that the candidate regions cover all pedestrians in an image to be detected; the specific setting of the training parameters is as follows: after the initial learning rate was set to 0.001 and 40000 batchs, the learning rate decreased to 1/10, i.e., to 0.0001, and after 45000 batchs, the learning rate continued to decrease to 0.00001, for a total of 50000 batchs.
Preferably, the step 3 comprises the steps of:
step 3.1, modifying the VGG-16 network, converting a complete connection layer into a convolutional layer, deleting the penultimate and third penultimate maximum pooling (Maxpool) and a line-crossing layer in the VGG-16 network structure so as to obtain a front-end prediction module, performing initialization training by using parameters of an original classification network, and outputting a feature map with higher resolution;
and 3.2, using a front-end prediction module to perform intensive prediction on the image to be detected to generate C64 multiplied by 64 preliminary semantic feature maps.
Preferably, the step 4 comprises the steps of:
step 4.1, a semantic segmentation system is constructed by utilizing the aggregated multi-scale context information, the input of the semantic segmentation system is C64 multiplied by 64 preliminary semantic feature maps generated by a front-end prediction module, the semantic segmentation system comprises 8 layers of networks, the front 7 layers are basic aggregated multi-scale context information modules, 3 multiplied by C expansion convolution kernels with different expansion factors are respectively applied to the front 7 layers for feature extraction, wherein the convolution is directly performed on the 1 st layer, the expansion convolution is performed on the 2 nd to 6 th layers by using different expansion factors, and the convolution is directly performed on the 7 th layer. Each convolution is followed by a point truncation max (·, 0) to truncate the part of the excess image, keeping the image size the same before and after the convolution. The last layer, layer 8, performs a 1 x 1 convolution. And finally, training the semantic segmentation system so that the semantic segmentation system outputs C64 x 64 refined semantic feature maps.
And 4.2, expanding convolution and aggregation multi-scale context information, and supporting exponential expansion of the receptive field without losing resolution or coverage rate. The size of the expanded area is (2)i+2-1)×(2i+2-1) a spreading factor of 2iThe size of the receptive field is ((2)i+2-1)-(2i+1-2))×((2i+2-1)-(2i+1-2)), i ═ 0,1,. and n-2 denote the second dilation; during expansion, expansion was stopped when the size of the receptive field substantially coincided with the input size, so that the expansion factors for layers 2 to 6 were 1, 2, 4, 8 and 16, respectively, and the expanded receptive field was 5 x 5, 9 x 9, 17 x 17, 33 x 33 and 65 x 65, respectively.
And 4.3, training the semantic segmentation system and the front-end prediction module in the step 3 on the Cityscapes data set in a combined manner, setting the 'person' and 'rider' classes in the Cityscapes data set as pedestrians, and setting the rest classes as backgrounds. The training parameters are specifically set as follows: by using a stochastic gradient descent method (SGD), the minimum batch is 14, the initial learning rate is set to 0.001, after 40000 lots, the learning rate is reduced to 1/10, i.e., to 0.0001, and after 45000 lots, the learning rate continues to decay to 0.00001, which is 60000 lots.
Preferably, the specific steps of step 5 are:
and 5.1, generating a binary mask feature map from the semantic feature map in the step 4, wherein the foreground pixels are set to be 1 to represent interested categories (such as pedestrians), and the background pixels are set to be 0.
Step 5.2, generating the pedestrian candidate area bounding box (b) generated by the pedestrian candidate area generator in the step 2x,by,bw,bh) Mapping the coordinate position information to a binary mask feature map to obtain a pedestrian candidate region boundary box on the binary mask feature map; scaling the pedestrian candidate region bounding boxes on all the binary mask feature maps to have the same size as the pedestrian kernel;
step 5.3, the soft fusion scale factor is used for weighting and calculating the pixels and the pedestrian kernels in the boundary box of the pedestrian candidate area on the binary mask feature map, and the calculation mode is as follows:
Figure GDA0002399116890000041
SResult=SYOLOv3×Sss
wherein: sssThe result of the semantic segmentation characteristic graph output by the semantic segmentation system is the score of the pedestrian; sYOLOv3A score indicating that the pedestrian candidate region generator outputs a pedestrian candidate region result as a pedestrian; sResultA score indicating that the final output result is a pedestrian; a. theBBIs the area of the bounding box; mask (i, j) is the binary mask pixel value at (i, j) in the image; kernel (i, j)) Is the pedestrian kernel at (i, j) in the image. Kernel, which tends to have higher pixel values in the center than at the boundary, coincides with the object of interest being in the center of the bounding box, and has the effect of enhancing detection, its bounding box fitting the object of interest (e.g., a pedestrian).
Step 5.4, according to SResultAnd (4) removing the boundary frame of the false detection pedestrian in the pedestrian candidate area in the step (2) to finally obtain the real pedestrian detection frame.
In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:
1. the invention effectively improves the pedestrian detection precision by using YOLOv3 as a pedestrian candidate region generator to generate a large number of pedestrian candidate frames;
2. the invention utilizes the front-end prediction module and semantic segmentation to classify the input images at pixel level, thereby avoiding the problem of rough detection of regression frame networks such as YOLOv3 and the like, improving the target detection capability and effectively solving the problem of insufficient detection precision of a single network;
3. the method utilizes soft fusion to fuse the pedestrian candidate frame and the semantic segmentation binary mask, thereby finely outputting a result; meanwhile, the small target detection capability is improved by combining the two, and the application range is wider;
4. the pedestrian detection system framework is formed by parallelly networking a pedestrian candidate area generator and a semantic segmentation system, so that rapid detection is realized; the system can accurately, efficiently and robustly detect pedestrians and other target classes in various challenging scenes.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
FIG. 1 is a flow chart of a pedestrian detection system of the present invention;
FIG. 2 is a network structure of the candidate pedestrian generator YOLOv3 of FIG. 1 according to the present invention;
FIG. 3 is a coordinate transformation formula diagram of the Bounding Box in FIG. 2 according to the present invention;
FIG. 4 is a schematic diagram of the architecture of the front-end prediction module base network VGG-16 of FIG. 1 in accordance with the present invention;
FIG. 5 is a convolution structure of the 0 th dilation in the semantic segmentation system of FIG. 1 according to the present invention;
FIG. 6 is a convolution structure of the 1 st dilation in the semantic segmentation system of FIG. 1 according to the present invention;
FIG. 7 is a graph showing the results of the soft fusion of FIG. 1 according to the present invention;
FIG. 8 is a diagram illustrating a context network architecture of the semantic segmentation system of FIG. 1 according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the detailed description and specific examples, while indicating the preferred embodiment of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
It is noted that relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The features and properties of the present invention are described in further detail below with reference to examples.
Example one
A pedestrian detection method based on deep learning multi-network soft fusion is characterized in that a flow chart of an implementation mode is shown in figure 1 and comprises two parallel operation parts of pedestrian candidate region extraction and pedestrian semantic segmentation, wherein the final pedestrian detection result of the whole system is refined through semantic segmentation, the operation speed of the system depends on a branch with slow processing, and finally the two results are fused and output in a soft fusion mode. The method specifically comprises the following steps:
step 1: and inputting an image to be processed.
Step 2: inputting the image in step 1 into a YOLOv3 pedestrian candidate area generator of the Darknet-53-based network in FIG. 2, and generating a pedestrian candidate area.
Further, the implementation steps of YOLOv3 in step 2 are as follows:
step 2.1, firstly, 3 scales (13 × 13, 26 × 26 and 52 × 52) are fused in the YOLOv3 network, detection is independently performed on the fusion feature maps of the multiple scales respectively, and the detection effect on the small target is enhanced. Secondly, clustering the data set by using a K-means clustering algorithm to generate an initial value of an Anchor Box, distributing 3 Anchor boxes under each scale, predicting 3 Bounding boxes for each cell, corresponding to the 3 Anchor boxes, and outputting (1+4+ C) × 3 values (4 positioning information, 1 confidence score and C conditional class probabilities) for each cell. Finally, the 4-dimensional position value t is calculated by the following formulax,ty,tw,thDecoding is performed as shown in FIG. 3Obtaining the central point coordinates (x, y) and the width and height (w, h) of the prediction frame:
bx=σ(tx)+cx
by=σ(ty)+cy
Figure GDA0002399116890000071
Figure GDA0002399116890000072
wherein: sigma (t)x),σ(ty) Is the offset based on the coordinate of the grid point at the upper left corner of the center point of the rectangular frame, sigma is the Sigmoid activation function, pw,phThe width and height of the prior frame are calculated by the formulaw,bh
The multitask training target loss function of YOLO v3 is given by the following equation:
Figure GDA0002399116890000073
wherein: λ is used to balance the class ratio of the prediction box with the object to the prediction box without the object; t'x、t'y、t'wAnd t'hRepresents a tag value;
Figure GDA0002399116890000074
if the corresponding group route is in the jth prediction box represented at the ith grid point, returning to 1, otherwise, returning to 0;
Figure GDA0002399116890000075
and if the jth prediction frame which represents the ith grid point has the corresponding group route, returning to 0, otherwise, returning to 1.
And 2.2, by utilizing the characteristic that each pedestrian candidate region is associated with the coordinate of the positioning frame and the confidence score of the pedestrian candidate region, firstly reducing the confidence threshold of the YOLO v3 detection candidate region, then generating a large number of candidate regions, and finally detecting all real pedestrians.
And 2.3, loading a pre-training model Darknet-53 obtained by training on ImageNet, deleting an original classifier, then carrying out fine tuning training on a Cityscapes data set, using an Adam optimizer in the training process, and expanding a model training sample by adopting a horizontal turning, angle adjustment, exposure, hue, saturation and other data enhancement mode when training the model, thereby enhancing the generalization performance of the model and reducing overfitting. After the initial learning rate was set to 0.001 and 40000 batchs, the learning rate decreased to 1/10, i.e., to 0.0001, and after 45000 batchs, the learning rate continued to decrease to 0.00001.
And step 3: and (3) inputting the image in the step (1) into a front-end prediction module, and outputting C characteristic graphs.
Further, the specific steps of step 3 are as follows:
and 3.1, converting the complete connection layer in the VGG-16 into a convolutional layer, and deleting the penultimate and third Maxpool and the cross-row layer in the VGG-16 network structure. Specifically, each of the Maxpool layer and the cross-row layer is removed, and for each deleted layer, the convolutions in all layers thereafter are enlarged by a factor of 2, and the convolutions in all layers thereafter are enlarged by a factor of 2 for each deleted layer. Thus, the convolution in the final layer after the two deleted layers is expanded by a factor of 4 and initialized using the parameters of the original classification network, resulting in a higher resolution output. Finally, a feature map is generated at a resolution of 64 × 64.
And 3.2, adjusting the VGG-16 network structure in the figure 4 to obtain a front-end prediction module so as to perform intensive prediction.
And 4, step 4: and (4) inputting the C feature maps in the step (3) into a semantic segmentation system, and outputting C feature maps containing context information.
Further, the specific steps of step 4 are as follows:
step 4.1, a semantic segmentation system is constructed by utilizing aggregated multi-scale context information, the input of the semantic segmentation system is C64 x 64 preliminary semantic feature maps generated by a front-end prediction module, the semantic segmentation system comprises 8 layers of networks, the network structure form of the semantic segmentation system is shown in FIG. 8, the first 7 layers are basic aggregated multi-scale context information modules, 3 x C expanded convolution kernels with different expansion factors are respectively applied to the first 7 layers for feature extraction, wherein the convolution is directly performed on the 1 st layer, the expansion convolution is performed on the 2 nd to 6 th layers by using different expansion factors respectively, and the convolution is directly performed on the 7 th layer. Each convolution is followed by a point truncation max (·, 0) to truncate the part of the excess image, keeping the image size the same before and after the convolution. The last layer, layer 8, performs a 1 x 1 convolution. And finally, training the semantic segmentation system so that the semantic segmentation system outputs C64 x 64 refined semantic feature maps.
Step 4.2, expanding the convolution to aggregate the multiscale context information, as shown in fig. 4 and 5, supports expanding the receptive field exponentially without loss of resolution or coverage. The size of the expanded area is (2)i+2-1×)(2i+2-1) a spreading factor of 2iThe size of the receptive field is ((2)i+2-1-)(2i+1-2×))((2i+2-1-)(2i+1-2)), i ═ 0,1,. and n-2 denote the second dilation; during expansion, expansion was stopped when the size of the receptive field substantially coincided with the input size, so that the expansion factors for layers 2 to 6 were 1, 2, 4, 8 and 16, respectively, and the expanded receptive field was 5 x 5, 9 x 9, 17 x 17, 33 x 33 and 65 x 65, respectively.
And 4.3, training the semantic segmentation system and the front-end prediction module in the step 3 on the Cityscapes data set in a combined manner, setting the 'person' and 'rider' classes in the Cityscapes data set as pedestrians, and setting the rest classes as backgrounds. The training parameters are specifically set as follows: by using a stochastic gradient descent method (SGD), the minimum batch is 14, the initial learning rate is set to 0.001, after 40000 lots, the learning rate is reduced to 1/10, i.e., to 0.0001, and after 45000 lots, the learning rate continues to decay to 0.00001, which is 60000 lots. And 5: and fusing the result of the semantic segmentation system with the pedestrian candidate result generated by the pedestrian candidate region generator.
Further, the specific steps of step 5 are as follows:
and 5.1, generating a binary mask feature map from the semantic feature map in the step 4, wherein the foreground pixels are set to be 1 to represent interested categories (such as pedestrians), and the background pixels are set to be 0.
Step 5.2, generating the pedestrian candidate area bounding box (b) generated by the pedestrian candidate area generator in the step 2x,by,bw,bh) Mapping the coordinate position information to a binary mask feature map to obtain a pedestrian candidate region boundary box on the binary mask feature map; scaling the pedestrian candidate region bounding boxes on all the binary mask feature maps to have the same size as the pedestrian kernel;
step 5.3, the soft fusion scale factor is used for weighting and calculating the pixels and the pedestrian kernels in the boundary box of the pedestrian candidate area on the binary mask feature map, and the calculation mode is as follows:
Figure GDA0002399116890000091
SResult=SYOLOv3×Sss
wherein: sssThe result of the semantic segmentation characteristic diagram output by the semantic segmentation system is the score of the pedestrian; s. theYOLOv3A score indicating that the pedestrian candidate region result is a pedestrian is output by the pedestrian candidate region generator; sResultA score indicating that the final output result is a pedestrian; a. theBBIs the area of the bounding box; mask (i, j) is the binary mask pixel value at (i, j) in the image; the Kernel (i, j) is the pedestrian Kernel at (i, j) in the image. Kernel, which tends to have higher pixel values in the center than at the boundary, coincides with the object of interest being in the center of the bounding box, and has the effect of enhancing detection, its bounding box fitting the object of interest (e.g., a pedestrian).
Step 5.4, according to SResultAnd (4) removing the boundary frame of the false detection pedestrian in the pedestrian candidate area in the step (2) to finally obtain the real pedestrian detection frame.
Step 6: and outputting the detection image.
In the invention, the pedestrian detection system utilizes YOLOv3 as a pedestrian candidate region generator to generate a large number of pedestrian candidate frames, thereby effectively improving the detection precision of pedestrians; the front-end prediction module and semantic segmentation are utilized to classify the input images at the pixel level, so that the problem of rough detection of regression frame networks such as YOLOv3 is solved, the target detection capability is improved, and the problem of insufficient detection precision of a single network can be effectively solved; fusing the pedestrian candidate frame and the semantic segmentation binary mask by using soft fusion so as to finely output a result; meanwhile, the small target detection capability is improved by combining the two, and the application range is wider; the pedestrian candidate area generator and the semantic segmentation two systems are subjected to parallel networking to form a pedestrian detection system framework, so that rapid detection is realized; the system can accurately, efficiently and robustly detect pedestrians and other target classes in various challenging scenes.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (4)

1. A pedestrian detection method based on deep learning multi-network soft fusion is characterized in that: the method comprises the following steps:
step 1: inputting an image to be processed;
step 2: inputting the image in the step 1 into a YOLO v3 pedestrian candidate area generator of a network based on Darknet-53 to generate a pedestrian candidate area;
and step 3: inputting the image in the step 1 into a front-end prediction module for intensive prediction, and outputting C feature maps with higher resolution;
and 4, step 4: inputting the C feature maps in the step 3 into a semantic segmentation system, and outputting C binary mask feature maps containing context information;
and 5: performing soft fusion on the result of the semantic segmentation system and the pedestrian candidate result generated by the pedestrian candidate region generator;
step 6: outputting a detection image;
the step 3 comprises the following steps:
step 3.1, modifying the VGG-16 network, converting a complete connection layer into a convolutional layer, deleting the last but one and the last but one maximum pooling and cross-row layers in the VGG-16 network structure so as to obtain a front-end prediction module, performing initialization training by using parameters of an original classification network, and outputting a feature map with higher resolution;
and 3.2, performing intensive prediction on the image to be detected by using a front-end prediction module to generate C64 multiplied by 64 preliminary semantic feature maps.
2. The pedestrian detection method based on deep learning multi-network soft fusion as claimed in claim 1, wherein: the generation of the pedestrian candidate region by the YOLO v3 pedestrian candidate region generator in the step 2 comprises the following steps:
step 2.1, dividing an input picture into S multiplied by S cells, distributing 3 pedestrian candidate region boundary frames to be predicted for each cell, and training YOLO v3 to obtain coordinate position information and confidence corresponding to each predicted pedestrian candidate region boundary frame in the picture;
2.2, fusing 3 scales in a YOLOv3 network, namely crossing layers 32, 16 and 8 from the first layer, and independently detecting pedestrians on the fusion characteristic graphs of the scales to obtain coordinate position information of a pedestrian candidate area;
secondly, clustering the data set by using a K-means clustering algorithm to generate an initial value of an anchor frame, distributing 3 anchor frames under each scale, and predicting 3 pedestrian candidate region boundary frames by each cell, wherein the 3 pedestrian candidate region boundary frames correspond to 3 anchor frames, so that 9 anchor frames are totally distributed under 3 scales;
each cell outputs (1+4+ C) × 3 values, 4 for 4 predicted positioning information, 1 for 1 confidence score, 3 for 3 anchor boxes and C for C conditional class probabilities, where C =1, only pedestrians are classified, so 18 values are output in total; and predicting the coordinate position information of the boundary frame of each pedestrian candidate area by adopting logistic regression:
Figure DEST_PATH_IMAGE002
wherein:
Figure DEST_PATH_IMAGE004
is a Sigmoid activation function that is,
Figure DEST_PATH_IMAGE006
4 predicted positioning information learned for the YOLO v3 web,
Figure DEST_PATH_IMAGE008
Figure DEST_PATH_IMAGE010
the width and height of the prior frame are preset,
Figure DEST_PATH_IMAGE012
Figure DEST_PATH_IMAGE014
is the coordinate offset of the cell and,
Figure DEST_PATH_IMAGE016
coordinate position information of a boundary frame of a pedestrian candidate area which is finally predicted;
and 2.3, in the process of YOLO v3 training, increasing a confidence coefficient receiving range in an original network of YOLO v3, namely, reducing a confidence coefficient threshold of a detected pedestrian candidate region, and generating a large number of pedestrian candidate regions, so that the candidate regions cover all pedestrians in the image to be detected.
3. The pedestrian detection method based on deep learning multi-network soft fusion as claimed in claim 1, wherein: the semantic segmentation system in the step 4 comprises the following steps:
step 4.1, constructing a semantic segmentation system by utilizing aggregated multi-scale context information, inputting the semantic segmentation system into C64 × 64 preliminary semantic feature maps generated by a front-end prediction module, wherein the semantic segmentation system comprises 8 layers of networks, the front 7 layers are basic aggregated multi-scale context information modules, extracting features of 3 × 3 × C expanded convolution kernels with different expansion factors applied to the front 7 layers respectively, wherein the convolution is directly performed on the 1 st layer, the expansion convolution is performed on the 2 nd to 6 th layers by using different expansion factors respectively, the 7 th layer is directly convolved, point truncation max (·, 0) is performed after each convolution to truncate the part exceeding the image, the sizes of the image before and after the convolution are kept to be the same, the last layer, namely the 8 th layer, performs 1 × 1 convolution, and finally training the semantic segmentation system, so that the semantic segmentation system outputs C64 x 64 refined semantic feature maps;
step 4.2, expanding convolution and aggregation multi-scale context information, supporting expanding the receptive field in an exponential mode without losing resolution or coverage rate, wherein the size of the expansion area is
Figure DEST_PATH_IMAGE018
The expansion factor is
Figure DEST_PATH_IMAGE020
The magnitude of receptive field is
Figure DEST_PATH_IMAGE022
Figure DEST_PATH_IMAGE024
Indicates the second expansion; stopping the expansion when the size of the receptive field substantially matches the input size during the expansion process, so that the expansion factors of layers 2 to 6 have sizes of 1, 2, 4, 8 and 16, respectively, and the expanded receptive field sizes are 5 × 5, 9 × 9, 17 × 17, 33 × 33 and 65 × 65, respectively;
and 4.3, training the semantic segmentation system and the front-end prediction module in the step 3 on the Cityscapes data set in a combined manner, setting the 'person' and 'rider' classes in the Cityscapes data set as pedestrians, setting the rest classes as backgrounds, and outputting C binary mask feature maps containing context information.
4. The pedestrian detection method based on deep learning multi-network soft fusion as claimed in any one of claims 1 to 3, wherein: the soft fusion of the step 5 comprises the following specific steps:
step 5.1, generating a binary mask feature map from the semantic feature map in the step 4, wherein foreground pixels are set to be 1 to represent 3 interesting categories, and background pixels are set to be 0;
step 5.2, generating the pedestrian candidate area boundary box generated by the pedestrian candidate area generator in the step 2
Figure DEST_PATH_IMAGE026
Mapping the coordinate position information to a binary mask feature map to obtain a pedestrian candidate region boundary box on the binary mask feature map; scaling the pedestrian candidate region bounding boxes on all the binary mask feature maps to have the same size as the pedestrian kernel;
step 5.3, the soft fusion scale factor is used for weighting and calculating the pixels and the pedestrian kernels in the boundary box of the pedestrian candidate area on the binary mask feature map, and the calculation mode is as follows:
Figure DEST_PATH_IMAGE028
wherein:
Figure DEST_PATH_IMAGE030
the result of the semantic segmentation characteristic diagram output by the semantic segmentation system is the score of the pedestrian;
Figure DEST_PATH_IMAGE032
a score indicating that the pedestrian candidate region generator outputs a pedestrian candidate region result as a pedestrian;
Figure DEST_PATH_IMAGE034
a score indicating that the final output result is a pedestrian;
Figure DEST_PATH_IMAGE036
is the area of the bounding box; mask and method for manufacturing the same
Figure DEST_PATH_IMAGE038
Is in the image
Figure DEST_PATH_IMAGE040
A binary mask pixel value of (a); core
Figure DEST_PATH_IMAGE042
Is in the image
Figure DEST_PATH_IMAGE044
The pixel value of the Kernel center is often higher than that of the boundary, which is consistent with the center of the interested object in the boundary box, and the Kernel has the effect of enhancing the detection, and the boundary box of the Kernel is suitable for the interested object;
and 5.4, removing the boundary frame of the false detection pedestrians in the pedestrian candidate area in the step 2 according to the r score, and finally obtaining a real pedestrian detection frame.
CN201911284456.4A 2019-12-13 2019-12-13 Pedestrian detection method based on deep learning multi-network soft fusion Active CN111027493B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911284456.4A CN111027493B (en) 2019-12-13 2019-12-13 Pedestrian detection method based on deep learning multi-network soft fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911284456.4A CN111027493B (en) 2019-12-13 2019-12-13 Pedestrian detection method based on deep learning multi-network soft fusion

Publications (2)

Publication Number Publication Date
CN111027493A CN111027493A (en) 2020-04-17
CN111027493B true CN111027493B (en) 2022-05-20

Family

ID=70208997

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911284456.4A Active CN111027493B (en) 2019-12-13 2019-12-13 Pedestrian detection method based on deep learning multi-network soft fusion

Country Status (1)

Country Link
CN (1) CN111027493B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111626156B (en) * 2020-05-14 2023-05-09 电子科技大学 Pedestrian generation method based on pedestrian mask and multi-scale discrimination
CN111860160B (en) * 2020-06-16 2023-12-12 国能信控互联技术有限公司 Method for detecting wearing of mask indoors
CN111783784A (en) * 2020-06-30 2020-10-16 创新奇智(合肥)科技有限公司 Method and device for detecting building cavity, electronic equipment and storage medium
CN111931729B (en) * 2020-09-23 2021-01-08 平安国际智慧城市科技股份有限公司 Pedestrian detection method, device, equipment and medium based on artificial intelligence
CN112329660A (en) * 2020-11-10 2021-02-05 浙江商汤科技开发有限公司 Scene recognition method and device, intelligent equipment and storage medium
CN112633086B (en) * 2020-12-09 2024-01-26 西安电子科技大学 Near-infrared pedestrian monitoring method, system, medium and equipment based on multitasking EfficientDet
CN112507904B (en) * 2020-12-15 2022-06-03 重庆邮电大学 Real-time classroom human body posture detection method based on multi-scale features
CN112668560B (en) * 2021-03-16 2021-07-30 中国矿业大学(北京) Pedestrian detection method and system for pedestrian flow dense area
CN112966697B (en) * 2021-03-17 2022-03-11 西安电子科技大学广州研究院 Target detection method, device and equipment based on scene semantics and storage medium
CN113011389B (en) * 2021-04-23 2022-07-26 电子科技大学 Road pedestrian small target detection method based on clustering idea
CN113536985A (en) * 2021-06-29 2021-10-22 中国铁道科学研究院集团有限公司电子计算技术研究所 Depth-of-field attention network-based passenger flow distribution statistical method and device
CN114005268A (en) * 2021-10-21 2022-02-01 广州通达汽车电气股份有限公司 Bus interval scheduling method, device, equipment and storage medium
CN116602663B (en) * 2023-06-02 2023-12-15 深圳市震有智联科技有限公司 Intelligent monitoring method and system based on millimeter wave radar
CN117475389B (en) * 2023-12-27 2024-03-15 山东海润数聚科技有限公司 Pedestrian crossing signal lamp control method, system, equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106709568A (en) * 2016-12-16 2017-05-24 北京工业大学 RGB-D image object detection and semantic segmentation method based on deep convolution network
CN108288075A (en) * 2018-02-02 2018-07-17 沈阳工业大学 A kind of lightweight small target detecting method improving SSD
CN108875595A (en) * 2018-05-29 2018-11-23 重庆大学 A kind of Driving Scene object detection method merged based on deep learning and multilayer feature
CN109063559A (en) * 2018-06-28 2018-12-21 东南大学 A kind of pedestrian detection method returned based on improvement region
CN109508710A (en) * 2018-10-23 2019-03-22 东华大学 Based on the unmanned vehicle night-environment cognitive method for improving YOLOv3 network
CN109543754A (en) * 2018-11-23 2019-03-29 中山大学 The parallel method of target detection and semantic segmentation based on end-to-end deep learning

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108416327B (en) * 2018-03-28 2022-04-29 京东方科技集团股份有限公司 Target detection method and device, computer equipment and readable storage medium
CN108960340B (en) * 2018-07-23 2021-08-31 电子科技大学 Convolutional neural network compression method and face detection method
CN109816100B (en) * 2019-01-30 2020-09-01 中科人工智能创新技术研究院(青岛)有限公司 Salient object detection method and device based on bidirectional fusion network

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106709568A (en) * 2016-12-16 2017-05-24 北京工业大学 RGB-D image object detection and semantic segmentation method based on deep convolution network
CN108288075A (en) * 2018-02-02 2018-07-17 沈阳工业大学 A kind of lightweight small target detecting method improving SSD
CN108875595A (en) * 2018-05-29 2018-11-23 重庆大学 A kind of Driving Scene object detection method merged based on deep learning and multilayer feature
CN109063559A (en) * 2018-06-28 2018-12-21 东南大学 A kind of pedestrian detection method returned based on improvement region
CN109508710A (en) * 2018-10-23 2019-03-22 东华大学 Based on the unmanned vehicle night-environment cognitive method for improving YOLOv3 network
CN109543754A (en) * 2018-11-23 2019-03-29 中山大学 The parallel method of target detection and semantic segmentation based on end-to-end deep learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
寇大磊 ; 权冀川 ; 张仲伟.基于深度学习的目标检测框架进展研究.《计算机工程与应用》.2019, *

Also Published As

Publication number Publication date
CN111027493A (en) 2020-04-17

Similar Documents

Publication Publication Date Title
CN111027493B (en) Pedestrian detection method based on deep learning multi-network soft fusion
CN108416266B (en) Method for rapidly identifying video behaviors by extracting moving object through optical flow
CN112396002A (en) Lightweight remote sensing target detection method based on SE-YOLOv3
CN111179217A (en) Attention mechanism-based remote sensing image multi-scale target detection method
CN112966691B (en) Multi-scale text detection method and device based on semantic segmentation and electronic equipment
JP2006209755A (en) Method for tracing moving object inside frame sequence acquired from scene
CN111160249A (en) Multi-class target detection method of optical remote sensing image based on cross-scale feature fusion
CN111274981B (en) Target detection network construction method and device and target detection method
CN111401293B (en) Gesture recognition method based on Head lightweight Mask scanning R-CNN
WO2020077940A1 (en) Method and device for automatic identification of labels of image
CN109165658B (en) Strong negative sample underwater target detection method based on fast-RCNN
CN114049381A (en) Twin cross target tracking method fusing multilayer semantic information
CN111553414A (en) In-vehicle lost object detection method based on improved Faster R-CNN
CN112381030B (en) Satellite optical remote sensing image target detection method based on feature fusion
CN114998595B (en) Weak supervision semantic segmentation method, semantic segmentation method and readable storage medium
CN114861842B (en) Few-sample target detection method and device and electronic equipment
CN114882423A (en) Truck warehousing goods identification method based on improved Yolov5m model and Deepsort
Ren et al. Research on infrared small target segmentation algorithm based on improved mask R-CNN
CN111931572B (en) Target detection method for remote sensing image
CN111738069A (en) Face detection method and device, electronic equipment and storage medium
CN116245843A (en) Vehicle paint defect detection and segmentation integrated method based on YOLOv5 frame
CN116091946A (en) Yolov 5-based unmanned aerial vehicle aerial image target detection method
CN115953743A (en) Parking space state identification method based on improved YOLO model
CN114693997A (en) Image description generation method, device, equipment and medium based on transfer learning
CN114332754A (en) Cascade R-CNN pedestrian detection method based on multi-metric detector

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant