CN113033371A - CSP model-based multi-level feature fusion pedestrian detection method - Google Patents
CSP model-based multi-level feature fusion pedestrian detection method Download PDFInfo
- Publication number
- CN113033371A CN113033371A CN202110295911.1A CN202110295911A CN113033371A CN 113033371 A CN113033371 A CN 113033371A CN 202110295911 A CN202110295911 A CN 202110295911A CN 113033371 A CN113033371 A CN 113033371A
- Authority
- CN
- China
- Prior art keywords
- target
- stage
- network
- size
- center point
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/50—Context or environment of the image
- G06V20/52—Surveillance or monitoring of activities, e.g. for recognising suspicious objects
- G06V20/53—Recognition of crowd images, e.g. recognition of crowd congestion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/07—Target detection
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computational Linguistics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Bioinformatics & Computational Biology (AREA)
- General Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Software Systems (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
A CSP model-based multi-level feature fusion pedestrian detection method adopts a CSP framework, uses CNN to extract pedestrian features, then a network is divided into 3 branches to respectively predict a target center point, a target height and a center point offset, after image preprocessing, PycnvResNet-101 is used as a feature extraction network to extract feature maps from input images, the obtained feature maps in different stages are subjected to multi-level fusion to obtain final feature maps, the final feature maps are sent to the prediction network, the prediction network is trained by using Focal local and Smooth L1, the target center point, the target height and the center point offset in the prediction maps are generated into target detection frames, and redundant detection frames are removed by using a non-maximum suppression algorithm to obtain a final detection result. The invention can fully integrate the abundant semantic information of the high-level characteristic diagram and the abundant position information of the low-level characteristic diagram, and effectively reduce false detection and missing detection under the conditions of small targets and serious shielding.
Description
Technical Field
The invention relates to the field of computer vision and target detection, in particular to a pedestrian detection method facing to videos.
Background
Computer vision has been a hot point and a difficult point of research in computer science, and pedestrian detection, as a subtask of target detection, has become a very important research problem in the field of computer vision. Convolutional Neural Networks (CNNs) have shown great power in the fields of computer vision and object detection in recent years. The development of many CNN-based general target detection methods has facilitated the development of research and application of pedestrian detection directions. But the pedestrian detection technology still has a great promotion space at present. The main problem is that the feature information of small objects and severely occluded objects is difficult to extract, resulting in missed detection and false detection. The csp (center and Scale prediction) is a pedestrian detection algorithm proposed in 2019, which learns pedestrian features through CNN, predicts central point coordinates and size information of a pedestrian target, and completes a pedestrian detection task.
Disclosure of Invention
Aiming at the problem of false detection caused by small targets and severe occlusion in pedestrian detection, the invention provides a CSP model-based multistage feature fusion pedestrian detection method, which can fully fuse rich semantic information of a high-level feature map and rich position information of a low-level feature map, and effectively reduce false detection and false detection under the conditions of small targets and severe occlusion.
A multi-level feature fusion pedestrian detection method based on a CSP model comprises the following steps:
step 2, obtaining the mostThe final feature map is sent to a subsequent prediction network, and the prediction network firstly adjusts the number of channels of the final feature map to 2 by using 3-by-3 convolutionnN is a positive integer, n is more than or equal to 3 and less than or equal to 9, then the target center point, the target height and the center point offset are respectively predicted by using two convolutions of 1 x 1 and one convolution of 2 x 2 to generate a target detection frame, and a non-maximum suppression algorithm is used for removing redundant detection frames to obtain a final detection result;
and 3, in the testing stage, adjusting the test image into a specific size and inputting the test image into the network, performing multi-stage fusion on the obtained characteristic graph and then sending the obtained characteristic graph into a prediction network, wherein the prediction network outputs the center point of the target, the height of the target and the offset of the center point of the target, and the target width is obtained by multiplying the target height by a coefficient.
Further, the process of step 1 is as follows: using the last profile p of stage two, stage three, stage four and stage five in a PycnvResNet-101 network2,p3,p4And p5Performing a multi-stage fusion wherein p2,p3,p4And p5Respectively obtaining the width and the height of an input image by respectively sampling 4 times, 8 times, 16 times and 32 times, and the multi-stage fusion comprises the following steps:
1.1) deconvolution with convolution kernel size 4 x 4, step size 2, margin 1 to p5Up-sampling 2 times and p4Splicing in the channel direction to obtain p4_l1(ii) a Deconvolution of p with convolution kernel size 4 x 4, step size 2, margin 1 was used to convolve p4Up-sampling 2 times and p3Splicing in the channel direction to obtain p3_l1(ii) a Deconvolution of p with convolution kernel size 4 x 4, step size 2, margin 1 was used to convolve p3Up-sampling 2 times and p2Splicing in the channel direction to obtain p2_l1;
1.2) deconvolution of the convolution kernel size 4 x 4, step size 2, margin 1, was used to fit the feature p obtained in 1.1)4_l1Up-sampling 2 times and obtaining the characteristic map p in 1.1)3_l1Splicing in the channel direction to obtain p3_l2(ii) a Deconvolution of p with convolution kernel size 4 x 4, step size 2, margin 1 was used to convolve p3_l1Up-sampling 2 times ofGet a characteristic diagram p2_l1Splicing in the channel direction to obtain p2_l2;
1.3), deconvolution with convolution kernel size 4 x 4, step size 2, margin 1, and fitting the resulting feature map p in 1.2)3_l2Up-sampling by 2 times and obtaining the characteristic map p in 1.2)2_l2Splicing in the channel direction to obtain a final characteristic diagram poutAnd sending the data into a subsequent prediction network.
Preferably, in the step 3, the coefficient is 0.41.
The invention has the beneficial effects that: the invention adopts CSP model architecture, adopts PycnvResNet-101 as a feature extraction network, performs multi-stage fusion on 4 feature graphs output by the feature extraction network, can fully fuse rich semantic information of a high-level feature graph and rich position information of a low-level feature graph, and can effectively reduce false detection and missed detection under the conditions of small targets and serious shielding.
Drawings
Fig. 1 is a flow chart of a multilevel feature fusion pedestrian detection method based on a CSP model according to the present invention.
FIG. 2 is a CSP model architecture diagram.
FIG. 3 is a schematic diagram of pyramid convolution.
Fig. 4 is a structure diagram of a multilevel feature fusion pedestrian detection method based on a CSP model.
Fig. 5 is a comparison graph of the effect of the CSP model-based multi-level feature fusion pedestrian detection method and other pedestrian detection technologies on the Caltech dataset, where (a) represents a Reasonable subset, (b) represents a Heavy subset, and (c) represents a Medium subset; (d) representing a Near subset; (e) representing the All subset.
Detailed Description
The invention is further illustrated by the following figures and examples.
Referring to fig. 1 to 5, a multilevel feature fusion pedestrian detection method based on a CSP model includes the following steps:
Each convolution kernel of PyconvResNet-101 comprises a multi-layered pyramid structure, each layer comprising a different type of convolution kernel, see fig. 3. Pyramid convolution can process an input image using convolution kernels of multiple scales without increasing the computational burden and model complexity. Convolution kernels at different levels of the pyramid have different sizes and channel numbers. The convolution kernel becomes progressively larger in size from the bottom layer to the top layer. At the same time, the number of channels of the convolution kernel gradually becomes smaller. And in order to adapt to the number of channels of convolution kernels of different layers of the pyramid, carrying out grouping convolution on the input feature map.
The specific steps of random erasing are as follows: for one image I in one batch, the probability of performing the random erasing process thereon is set to 0.5. For a picture with width W and height H, the picture area S is W × H. Random initialization erase region Se. Is provided withSetting the aspect ratio of the erase region to re∈[0.3,3.3]. The erase region has a height of Width isThen, randomly on the image IInitializing a point a ═ xe,ye) If x ise+WeW and y are not more thane+HeSetting the area I when the height is less than or equal to He=(xe,ye,xe+We,ye+He) As a randomly erased area, otherwise repeating the above steps until a qualified IeAnd occurs. I iseEach pixel within a region is assigned a [0,255 ] value]Is calculated.
Referring to fig. 4, the last profile p of phase two, phase three, phase four and phase five in a pyconvResNet-101 network is used2,p3,p4And p5Performing a multi-stage fusion wherein p2,p3,p4And p5Respectively, the width and the height of the input image are respectively obtained by sampling 4 times, 8 times, 16 times and 32 times. The fusion mode specifically comprises the following steps:
1.1) deconvolution with convolution kernel size 4 x 4, step size 2, margin 1 to p5Up-sampling 2 times and p4Splicing in the channel direction to obtain p4_l1(ii) a Deconvolution of p with convolution kernel size 4 x 4, step size 2, margin 1 was used to convolve p4Up-sampling 2 times and p3Splicing in the channel direction to obtain p3_l1(ii) a Deconvolution of p with convolution kernel size 4 x 4, step size 2, margin 1 was used to convolve p3Up-sampling 2 times and p2Splicing in the channel direction to obtain p2_l1。
1.2) deconvolution of the convolution kernel size 4 x 4, step size 2, margin 1, with the feature p obtained in step 14_l1After 2 times of upsampling, the feature map p obtained in step 1.1 is obtained3_l1Splicing in the channel direction to obtain p3_l2(ii) a Deconvolution of p with convolution kernel size 4 x 4, step size 2, margin 1 was used to convolve p3_l1After 2 times of upsampling, the feature map p obtained in step 1.1 is obtained2_l1Splicing in the channel direction to obtain p2_l2。
1.3) deconvolution with convolution kernel size 4 x 4, step size 2, margin 1, the feature map p obtained in step 1.23_l2After 2 times of upsampling, the feature map p obtained in step 1.22_l2Splicing in the channel direction to obtain a final characteristic diagram poutAnd sending the data into a subsequent prediction network.
Using Focal local and Smooth L1 as Loss functions: predicting the target center point is a binary problem, namely judging whether the target center point exists in each position of the feature map, if so, determining the target center point as a positive sample, and otherwise, determining the target center point as a negative sample. Because the negative samples around the positive sample are very close to the central point, the training is easy to be disturbed, and therefore, the two-dimensional Gaussian mask is added on the positive sample point during the training:
where K is the target number in the picture,is the center point, width and height information of the kth target. Variance of Gaussian maskProportional to the height and width of the individual targets, respectively. If there is coincidence between the Gaussian masks of the two targets, the maximum of the two is selected.
For the predicted target center point, Focal local is adopted:
pij∈[0,1]indicating the possibility of the network determining the presence of an object center at location (i, j), yijE {0,1} represents a ground route tag, where yijWith 1 representing a positive sample, β and γ are hyper-parameters, set to 4 and 2, respectively.
Predicting the target height and center point offset is a regression problem, using Smooth L1:
wherein s iskAnd tkRespectively representing the predicted value and the true value of the network for each positive sample.
The total loss function is a weighted sum of three branch loss functions:
L=λcLcenter+λsLscale+λoLoffset
wherein λc,λs,λoThe weight coefficients for the target center point classification loss, the scale regression loss, and the offset regression loss were set to 0.01, 1, and 0.1, respectively.
And 2, sending the obtained final feature map into a subsequent prediction network, wherein the prediction network firstly uses a convolution layer with a convolution kernel of 3 x 3, the step length of 1 and the margin of 1 to adjust the channel of the input feature map to 2nN is a positive integer, n is not less than 3 and not more than 9, in this embodiment, n is 8, and then the convolution kernels of 1 × 1, 1 × 1 and 2 × 2 are respectively used to predict the target center point, the target height and the offset of the target center point.
And in the testing stage, the testing image is adjusted to a specific size and then is input into the network, the obtained characteristic graph is subjected to multi-stage fusion and then is sent into the prediction network, the prediction network outputs the center point of the target, the height of the target and the offset of the center point of the target, and the target width is obtained by multiplying the target height by a coefficient of 0.41. And analyzing to obtain a pedestrian prediction frame, and finally removing the redundant prediction frame by using a non-maximum suppression algorithm to obtain a final pedestrian detection frame.
The CSP model-based multi-level feature fusion pedestrian detection method is trained on a citreprersons training set and a Caltech training set respectively, tests are carried out on a citrerpersons verification set and a Caltech test set, and the evaluation index is the average logarithm omission ratio. As shown in table 1, table 2 and fig. 5, the method of the present invention is improved by 0.8%, 3.1%, 1.0%, 0.1%, 1.8% and 1.0% respectively in the citrerpersons validation set accessible subset, the Heavy subset, the Partial subset, the Bare subset and the Large subset, compared to the CSP algorithm. The improvement is respectively 0.4%, 10.5% and 4.8% in the Caltech test set Reasonable subset, Heavy subset and All subset. And simultaneously, the pedestrian detection device also shows a better effect in comparison with the existing pedestrian detection technology. Experimental results show that the detection performance of the CSP algorithm on small targets and seriously-shielded targets is effectively improved.
Table 1 shows the average log miss rate of each subset on the Citypersons validation set
TABLE 1
Table 2 shows the average log miss rate for each subset on the Caltech test set
Table 2.
Claims (3)
1. A CSP model-based multi-level feature fusion pedestrian detection method is characterized by comprising the following steps:
step 1, adopting a CSP (compact size distribution) framework, extracting pedestrian features by using CNN (CNN), then respectively predicting a target central point, a target height and a central point offset by dividing a network into 3 branches, preprocessing a training image in a training stage, and then inputting the preprocessed training image into the network, wherein the preprocessing comprises the steps of adjusting the size of the image to set pixels, randomly cutting the image and adjusting the brightness, extracting the pedestrian features by using PycnvResNet-101 as a feature extraction network, performing multi-level fusion on 4 feature maps obtained in a second stage, a third stage, a fourth stage and a fifth stage of the PycnvResNet-101 network to obtain a final feature map, wherein the number of channels of the final feature map is 1024, and randomly erasing data enhancement is used for expanding the training data, and the target central point, the target height and the central point offset are trained by using Focal Loss and Smooth L1;
step 2, sending the obtained final feature map into a subsequent prediction network, and firstly adjusting the number of channels of the final feature map to 2 by using 3-by-3 convolution in the prediction networknN is a positive integer, n is more than or equal to 3 and less than or equal to 9, then the target center point, the target height and the center point offset are respectively predicted by using two convolutions of 1 x 1 and one convolution of 2 x 2 to generate a target detection frame, and a non-maximum suppression algorithm is used for removing redundant detection frames to obtain a final detection result;
and 3, in the testing stage, adjusting the test image into a specific size and inputting the test image into the network, performing multi-stage fusion on the obtained characteristic graph and then sending the obtained characteristic graph into a prediction network, wherein the prediction network outputs the center point of the target, the height of the target and the offset of the center point of the target, and the target width is obtained by multiplying the target height by a coefficient.
2. The CSP model-based multi-level feature fusion pedestrian detection method according to claim 1, characterized in that: the process of the step 1 is as follows: using the last profile p of stage two, stage three, stage four and stage five in a PycnvResNet-101 network2,p3,p4And p5Performing a multi-stage fusion wherein p2,p3,p4And p5Respectively obtaining the width and the height of an input image by respectively sampling 4 times, 8 times, 16 times and 32 times, and the multi-stage fusion comprises the following steps:
1.1) deconvolution with convolution kernel size 4 x 4, step size 2, margin 1 to p5Up-sampling 2 times and p4In the direction of the passageRow stitching to p4_l1(ii) a Deconvolution of p with convolution kernel size 4 x 4, step size 2, margin 1 was used to convolve p4Up-sampling 2 times and p3Splicing in the channel direction to obtain p3_l1(ii) a Deconvolution of p with convolution kernel size 4 x 4, step size 2, margin 1 was used to convolve p3Up-sampling 2 times and p2Splicing in the channel direction to obtain p2_l1;
1.2) deconvolution of the convolution kernel size 4 x 4, step size 2, margin 1, was used to fit the feature p obtained in 1.1)4_l1Up-sampling 2 times and obtaining the characteristic map p in 1.1)3_l1Splicing in the channel direction to obtain p3_l2(ii) a Deconvolution of p with convolution kernel size 4 x 4, step size 2, margin 1 was used to convolve p3_l1After 2 times of upsampling, the characteristic diagram p obtained in the step (1)2_l1Splicing in the channel direction to obtain p2_l2;
1.3), deconvolution with convolution kernel size 4 x 4, step size 2, margin 1, and fitting the resulting feature map p in 1.2)3_l2Up-sampling by 2 times and obtaining the characteristic map p in 1.2)2_l2Splicing in the channel direction to obtain a final characteristic diagram poutAnd sending the data into a subsequent prediction network.
3. The CSP model-based multi-level feature fusion pedestrian detection method according to claim 1, characterized in that: in step 3, the coefficient is 0.41.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110295911.1A CN113033371A (en) | 2021-03-19 | 2021-03-19 | CSP model-based multi-level feature fusion pedestrian detection method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110295911.1A CN113033371A (en) | 2021-03-19 | 2021-03-19 | CSP model-based multi-level feature fusion pedestrian detection method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113033371A true CN113033371A (en) | 2021-06-25 |
Family
ID=76471689
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110295911.1A Pending CN113033371A (en) | 2021-03-19 | 2021-03-19 | CSP model-based multi-level feature fusion pedestrian detection method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113033371A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113723322A (en) * | 2021-09-02 | 2021-11-30 | 南京理工大学 | Pedestrian detection method and system based on single-stage anchor-free frame |
WO2023001059A1 (en) * | 2021-07-19 | 2023-01-26 | 中国第一汽车股份有限公司 | Detection method and apparatus, electronic device and storage medium |
-
2021
- 2021-03-19 CN CN202110295911.1A patent/CN113033371A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023001059A1 (en) * | 2021-07-19 | 2023-01-26 | 中国第一汽车股份有限公司 | Detection method and apparatus, electronic device and storage medium |
CN113723322A (en) * | 2021-09-02 | 2021-11-30 | 南京理工大学 | Pedestrian detection method and system based on single-stage anchor-free frame |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109859190B (en) | Target area detection method based on deep learning | |
CN108509978B (en) | Multi-class target detection method and model based on CNN (CNN) multi-level feature fusion | |
CN113052834B (en) | Pipeline defect detection method based on convolution neural network multi-scale features | |
CN108520203B (en) | Multi-target feature extraction method based on fusion of self-adaptive multi-peripheral frame and cross pooling feature | |
CN112215795B (en) | Intelligent detection method for server component based on deep learning | |
CN111680705B (en) | MB-SSD method and MB-SSD feature extraction network suitable for target detection | |
CN112308825B (en) | SqueezeNet-based crop leaf disease identification method | |
CN114494981B (en) | Action video classification method and system based on multi-level motion modeling | |
CN111696110A (en) | Scene segmentation method and system | |
CN113033371A (en) | CSP model-based multi-level feature fusion pedestrian detection method | |
CN114155474A (en) | Damage identification technology based on video semantic segmentation algorithm | |
CN116721414A (en) | Medical image cell segmentation and tracking method | |
Singh et al. | Semantic segmentation using deep convolutional neural network: A review | |
CN111553337A (en) | Hyperspectral multi-target detection method based on improved anchor frame | |
Aldhaheri et al. | MACC Net: Multi-task attention crowd counting network | |
CN112991281B (en) | Visual detection method, system, electronic equipment and medium | |
CN112488220B (en) | Small target detection method based on deep learning | |
CN111339950B (en) | Remote sensing image target detection method | |
CN112132207A (en) | Target detection neural network construction method based on multi-branch feature mapping | |
TWI809957B (en) | Object detection method and electronic apparatus | |
CN116563285A (en) | Focus characteristic identifying and dividing method and system based on full neural network | |
CN116542988A (en) | Nodule segmentation method, nodule segmentation device, electronic equipment and storage medium | |
CN114494893B (en) | Remote sensing image feature extraction method based on semantic reuse context feature pyramid | |
CN113269734B (en) | Tumor image detection method and device based on meta-learning feature fusion strategy | |
CN112396126A (en) | Target detection method and system based on detection of main stem and local feature optimization |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |