CN111104921A

CN111104921A - Multi-mode pedestrian detection model and method based on Faster rcnn

Info

Publication number: CN111104921A
Application number: CN201911390948.1A
Authority: CN
Inventors: 柯良军; 陆鑫; 孙凯旋; 董鹏辉
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2019-12-30
Filing date: 2019-12-30
Publication date: 2020-05-05

Abstract

A multi-mode pedestrian detection model and method based on Faster rcnn comprises input data alignment processing and a parallel feature extraction network, and results obtained by the parallel feature extraction network are processed through a subsequent RPN network and a classification network, so that category classification and position regression are performed. The method effectively judges the position of the pedestrian in the video or the picture, and simultaneously avoids the problems of false detection when the pedestrian is shielded and missed detection when the object shields the human body.

Description

Multi-mode pedestrian detection model and method based on Faster rcnn

Technical Field

The invention relates to the technical field of pedestrian detection models, in particular to a multi-mode pedestrian detection model and method based on Fasterrcnn.

Background

Human body detection is one of the most applied research directions in the field of computer vision, and is also one of the key and difficult problems. The human body detection problem is that whether a human body exists or not is judged in a video or a picture, and if the human body exists, the position of the human body needs to be output. The human body detection has important practical application value in the fields of unmanned driving, intelligent security and home service robots, and is a premise and basis for numerous applications such as human body behavior and gait analysis, human body identity recognition and pedestrian tracking. Early human body detection tasks are generally performed on the basis of color maps, and with the continuous development of deep learning methods, the utilization rate of information contained in the color maps is close to saturation. Because the color image is easy to be subjected to inherent defects such as illumination change and the like, the simple use of the color image for human body detection has little potential.

The depth map contains depth information of the external environment, so that geometric shape information of an object is represented, and meanwhile, the depth map has good illumination invariance which is not possessed by a color map. For these reasons, research into human detection based on RGB-D multimodal data is increasingly active in computer vision and robotics and other disciplines.

Most of the existing pedestrian detection algorithms are single-input networks only taking RGB images as input, and are easily influenced by the brightness, contrast and image blur of the RGB images; meanwhile, the discrimination degree of the model on the whole features which can be extracted by the shielded pedestrians is not high.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a multimode pedestrian detection model and a multimode pedestrian detection method based on Faster rcnn, which can effectively judge the position of a pedestrian in a video or picture and simultaneously avoid the problems of false detection when people are shielded and missed detection when objects shield the human body.

In order to achieve the purpose, the invention adopts the technical scheme that:

a multimode pedestrian detection model based on Faster rcnn comprises input data alignment processing and a parallel feature extraction network, wherein results obtained by the parallel feature extraction network are processed through a subsequent RPN network and a classification network, so that category classification and position regression are performed; the input data alignment processing adopts a Zhang Zhengyou camera calibration method to calibrate a depth camera, a depth map is converted into a color map image coordinate system, then overlapped parts in the color map and the depth map are intercepted and stored respectively to obtain a group of aligned color map and depth map, when feature maps of different modes are merged, color map features and depth map features at the same position can be merged together to play a role together, and a parallel feature extraction network respectively extracts the features of color map data and depth map data by using two independent convolutional neural networks to serve as the basis for the feature fusion of the subsequent two modes.

A multimode pedestrian detection method based on Faster rcnn comprises the following steps;

firstly, input data alignment processing:

secondly, parallel feature extraction network:

thirdly, the method comprises the following steps: and processing the result obtained by the parallel feature extraction network through a subsequent RPN network and a classification network so as to perform class classification and position regression.

The input data alignment processing specifically comprises:

the method comprises the following steps: the method comprises the steps that a Microsoft 2 generation Kinect depth sensor is used for collecting, 5 scenes in real life are included, and various human body postures are included;

step two: calibrating the depth camera by adopting a Zhang Zhengyou camera calibration method, converting the depth map into a color image coordinate system, then intercepting overlapped parts in the color image and the depth map, and respectively storing to obtain a group of aligned color images and depth maps;

step three: and (3) encoding the depth map by a Jet color map to obtain a depth map and a color map intercepted in a color map image coordinate system, and sending the depth map and the color map into a pedestrian detection model.

The parallel feature extraction network specifically comprises:

the method comprises the following steps: extracting deep characteristic information from the input color image and the input depth image by using different characteristic extraction networks to obtain a characteristic image;

step two: carrying out L2 normalization processing on the feature map obtained in the last step;

assume that an original input picture inputted in parallel is (I)_RGBI_Depth) After feature extraction through the convolutional neural network, a group of parallel feature maps (f) are obtained_RGB,f_Depth) Suppose a feature map (f)_RGB,f_Depth) F and r × c, the feature map u is normalized by L2_fComprises the following steps:

wherein:

after the two groups of feature maps are respectively subjected to L2 normalization, the numerical values of the two groups of feature maps are scaled to the same scale, and the two groups of feature maps jointly play a role in the final detection result;

step three: for normalized feature maps

Characteristic map of each channel in the system

Designing a scale parameter gamma_iAmplifying the channel characteristic diagram in a certain proportion, and amplifying the scale parameter to obtain a characteristic diagram F_iComprises the following steps:

and processing the result obtained by the parallel feature extraction network through a subsequent RPN network and a classification network so as to perform class classification and position regression.

The subsequent RPN network and classification network are consistent with the fast Rcnn network.

I_RGBRepresenting an RGB input image, I_DepthRepresenting depth mapsAn image is input. f. of_RGBAnd representing the RGB feature map output by the feature extraction layer. f. of_DepthAnd representing the depth map feature map output by the feature extraction layer.

Representing the corresponding RGB image and depth map normalized feature maps. Gamma ray_iCorresponding to the scale parameter on the ith feature map. F_iShowing the ith feature map after being amplified by the scale parameter.

The invention has the beneficial effects that:

the method and the device introduce the information of the depth map as auxiliary information of pedestrian detection, can effectively overcome the problem that RGB images are sensitive to illumination and pedestrian shielding, and improve the performance of a pedestrian detection network; and a characteristic block algorithm is introduced, so that the local discrimination of the pedestrian under the shielding condition is effectively improved.

Drawings

Fig. 1 is an overall technical flow diagram.

FIG. 2 is a schematic operational flow diagram.

Fig. 3 is a schematic diagram of a laboratory shot.

Fig. 4 is a schematic diagram of a conference room shot.

Fig. 5 is a schematic view of office photography.

Fig. 6 is a diagram of a corridor shot.

Fig. 7 is a schematic view of hall photography.

Fig. 8 is a diagram of erroneous detection in verification.

FIG. 9 is a schematic diagram of the false detection promotion during verification.

FIG. 10 is a diagram illustrating the results of the detection without parallelism.

FIG. 11 is a diagram illustrating the parallel detection results.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings.

As shown in fig. 1 and 2:

the input data comprises RGB data and correspondingly aligned depth map data, the feature maps of the corresponding data are respectively extracted through a feature extraction network, the value range of the feature values of the whole RGB feature maps is larger than that of the depth maps, so that two groups of feature maps need to be normalized respectively, the feature values of the two groups of feature maps are distributed in the same value range, the pedestrian detection is performed quite effectively, meanwhile, the feature data of the depth maps comprise depth information blocking pedestrians, and the blocked pedestrians can be better detected.

The color image and depth image data are used as input data of a pedestrian detection model, a target detection model FasterRCNN is used as a basic detection framework, a parallel feature extraction network is designed to integrate multi-modal input data, and depth information is introduced to improve the detection capability of the network on blocking pedestrians.

As shown in fig. 2:

firstly, input data alignment processing:

the method comprises the following steps: calibrating the depth camera by adopting a Zhang Zhengyou camera calibration method, converting the depth map into a color image coordinate system, then intercepting overlapped parts in the color image and the depth map, and respectively storing to obtain a group of aligned color images and depth maps;

step two: encoding the depth map by a Jet color map to obtain an equal depth map and an original color picture, and sending the equal depth map and the original color picture into a pedestrian detection model;

secondly, parallel feature extraction network:

the method comprises the following steps: extracting deep characteristic information from the input color image and depth image by using different characteristic extraction networks;

assume that an original input picture inputted in parallel is (I)_RGBI_Depth) After feature extraction by the convolutional neural network, a set of feature maps (f) is obtained_RGB,f_Depth) The feature maps are often multi-channel, and the feature maps of all channels are operated by taking a single-channel feature map as a unit, and the feature map (f) is assumed_RGB,f_Depth) F and r × c, the feature map u is normalized by L2_fComprises the following steps:

wherein:

step three: for normalized feature maps

Characteristic map of each channel in the system

in the method, the scale parameters corresponding to each channel in the characteristic diagram can be obtained by learning through a Back Propagation (BP) algorithm, and the scale parameters obtained by automatic learning can better improve the robustness of network training;

There are only two classification results, whether pedestrian or not.

As shown in fig. 3 to 7: there are 2647 aligned color and depth maps, 5372 human examples. The human body examples comprise various human body postures such as standing posture, sitting posture and the like. The details of this data set are shown in the table below.

Examples of databases are as follows:

the 2647 pairs of pictures were randomly assigned training and testing sets in a 9:1 ratio.

As shown in fig. 8 to 9: the improvement of the parallel fast RCNN on false detection can effectively overcome the problem that RGB images are sensitive to illumination and pedestrian shielding, and the performance of a pedestrian detection network is improved; and a characteristic block algorithm is introduced, so that the local discrimination of the pedestrian under the shielding condition is effectively improved.

Claims

1. A multimode pedestrian detection model based on Faster rcnn is characterized by comprising input data alignment processing and a parallel feature extraction network, wherein the result obtained by the parallel feature extraction network is processed through a subsequent RPN network and a classification network so as to carry out category classification and position regression; the input data alignment processing adopts a Zhang Zhengyou camera calibration method to calibrate a depth camera, a depth map is converted into a color map image coordinate system, then overlapped parts in the color map and the depth map are intercepted and stored respectively to obtain a group of aligned color map and depth map, when feature maps of different modes are merged, color map features and depth map features at the same position can be merged together to play a role together, and a parallel feature extraction network respectively extracts the features of color map data and depth map data by using two independent convolutional neural networks to serve as the basis for the feature fusion of the subsequent two modes.

2. A multi-mode pedestrian detection method based on Faster rcnn is characterized by comprising the following steps;

firstly, input data alignment processing:

secondly, parallel feature extraction network:

3. The multi-modal pedestrian detection method according to claim 2, wherein the input data alignment process specifically comprises:

4. The multi-modal pedestrian detection method based on fast rcnn according to claim 2, characterized in that the parallel feature extraction network specifically comprises:

wherein:

step three: for normalized feature maps

Characteristic map of each channel in the system

5. The method according to claim 4 wherein the subsequent RPN network and classification network is identical to the Faster Rcnn network.