CN111738088B

CN111738088B - Pedestrian distance prediction method based on monocular camera

Info

Publication number: CN111738088B
Application number: CN202010450217.8A
Authority: CN
Inventors: 钱学明; 杨瑾; 邹屹洋; 侯兴松
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2020-05-25
Filing date: 2020-05-25
Publication date: 2022-10-25
Anticipated expiration: 2040-05-25
Also published as: CN111738088A

Abstract

The invention discloses a pedestrian distance prediction method based on a monocular camera, which comprises the following steps: determining a pedestrian head height-pedestrian camera distance model by using a monocular camera, and acquiring a video; marking a pedestrian detection and pedestrian distance sample set; building and obtaining a convolutional neural network model; training the obtained convolutional neural network model by using a training sample to obtain a pedestrian detection and distance prediction model; and inputting the picture to be detected into the trained pedestrian detection and distance prediction model to obtain the coordinates, the score and the distance of the pedestrian. The invention gives full play to the advantages of the deep learning detection method, keeps high precision and good robustness, can detect the pedestrian and predict the distance between the pedestrian and the camera more accurately under the condition of using a monocular camera with lower cost, and simultaneously can not be interfered when the pedestrian is close to the camera or the pedestrian is shielded, and can still normally predict the distance between the pedestrian.

Description

Pedestrian distance prediction method based on monocular camera

Technical Field

The invention belongs to the technical field of computer digital image processing and pattern recognition, and particularly relates to a pedestrian distance prediction method based on a monocular camera.

Background

Ensuring pedestrian safety is one of the important goals of road traffic safety systems, which makes pedestrian detection a core component in driving assistance systems (ADAS).

Currently, most pedestrian detection in ADAS is based on visual detection methods. From the early detection method based on background modeling and statistical learning to the pedestrian detection model based on the deep neural network in recent years, the method has better effect in the pedestrian detection field. Especially, a pedestrian detection model based on a deep neural network has become one of the research hotspots in the pedestrian detection field due to the higher detection accuracy and the better robustness.

The distance information of the pedestrian can be acquired by using distance measuring equipment such as laser radar, but the laser radar has the defect of high cost.

Disclosure of Invention

The invention aims to provide a pedestrian distance prediction method based on a monocular camera, which is used for completing the detection of pedestrians and the distance estimation of the pedestrians from the monocular camera so as to reduce the cost of the pedestrian distance prediction.

In order to achieve the purpose, the invention adopts the following technical scheme:

a pedestrian distance prediction method based on a monocular camera comprises the following steps:

step 1: determining a pedestrian head height-pedestrian camera distance model by using a monocular camera, and acquiring a video;

and 2, step: marking pedestrian detection and pedestrian distance to obtain a training sample set;

and step 3: adding a structural branch for pedestrian distance regression prediction on the deep learning target detection model, and constructing and obtaining a convolutional neural network model;

and 4, step 4: training the convolutional neural network model obtained in the step 3 through the training sample set obtained in the step 2 to obtain a trained convolutional neural network model serving as a pedestrian detection and distance prediction model;

and 5: and (4) inputting the pictures collected by the monocular camera into the pedestrian detection and distance prediction model obtained in the step (4) to obtain the coordinates, scores and distances of pedestrians, and completing the pedestrian distance prediction based on the monocular camera.

Further, step 1 uses the monocular camera, fixes the camera height, combines the internal parameter of camera, utilizes aperture formation of image principle and triangle-shaped similar principle, determines pedestrian's head height-pedestrian camera apart from the model: assuming that the height of the head of the pedestrian is h and the actual distance between the pedestrian and the camera is d, a conversion coefficient a is obtained, so that d = h × a.

Further, step 2 specifically includes:

2.1 Carrying out frame extraction processing on the video acquired in the step 1, wherein the frame extraction interval is 25 frames, and obtaining an initial frame picture of the video;

2.2 Judging the initial frame picture by using a motion blur algorithm, and removing a blur picture in the initial frame picture;

2.3 Marking the initial frame picture after removing the fuzzy frame, marking the head of the pedestrian and the head of the pedestrian, wherein the type of the pedestrian is marked as person, and the head of the pedestrian is marked as head; after the labeling is finished, each picture generates an xml file corresponding to the Pascal VOC format;

2.4 Processing the initial xml file after marking, converting the model of pedestrian head height-pedestrian camera distance according to the pixel height of the head and the model of the pedestrian head height-pedestrian camera distance obtained in the step 1 to obtain the distance between the pedestrian and the camera, adding the distance into the xml file as dist attribute, deleting the head position of the pedestrian of the initial xml file, and finally obtaining the final xml file.

Further, the deep learning target detection model in the step 3 is YOLO, faster-RCNN, SSD or RetinaNet.

Further, step 3 specifically includes:

3.1 Modifying a data reading portion in the deep learning target detection model; reading distance marks in the xml file is added, and the pedestrian distance is read in addition to the position coordinates of the pedestrians;

3.2 A pedestrian distance prediction branch is added to the modified deep learning target detection model of step 3.1).

In step 3.2), the distance prediction branch is composed of 5 convolutional layers, the size of each convolutional kernel is 3 multiplied by 3, the convolutional step length is 1 and is filled with 1, the output of the first four convolutional layers is output, and the number of the last convolutional output channels is the number of anchor frames; selecting ResNet50 as a basic network for extracting the characteristics of the target detection network, performing characteristic fusion on the extracted characteristics by using a characteristic pyramid with 5 layers, and connecting a coordinate regression branch, a target category branch and a distance prediction branch behind each characteristic layer; the number of channels of the 5-layer feature map in the feature pyramid FPN is 256, and the sizes of the feature maps are 100 × 136, 50 × 68, 25 × 34, 13 × 17, and 7 × 9, respectively.

Further, step 4 specifically includes:

4.1 Adds MSELoss to the network total Loss function Loss to constrain the pedestrian distance branch fraction, the MSELoss is calculated as l (x, y) = (x-y) ² (ii) a y is the pedestrian distance x predicted by the model, and the total loss function of the marked, namely the actual pedestrian distance network consists of 3 parts: focalLoss of the confidence coefficient of the pedestrian, smoothL1Loss of coordinate regression, and MSELoss of mean square error Loss of distance regression; the pedestrian distance loss function herein may also use other regression loss functions including, but not limited to, mselos;

4.2 Model training, wherein a frame used in the training is Pythrch, a basic network ResNet50 of the model uses a pre-training model on an ImageNet classification task, a Fine-tuning strategy is adopted for training on the basis of the model, an optimization algorithm used in the training is a small-batch random gradient descent algorithm, a total of 24 epochs are trained, and the number of samples in each batch is 4; the trained convolutional neural network model has the functions of pedestrian detection and pedestrian distance prediction.

Further, step 5 specifically includes:

5.1 Input the pictures collected from the monocular camera into the pedestrian detection and distance prediction model to obtain the output result of the network, wherein the output result is n vectors with 6 dimensions output by the network after the detection is finished, and the nth vector of the detection result is R _n ＝{x _n ,y _n ,w _n ,h _n ,s _n ,d _n In which { x } _n ,y _n ,w _n ,h _n Respectively representing the coordinates of the upper left corner of the coordinate frame corresponding to the nth detection result and the width and the height of the frame, s _n Confidence that the nth detection result is a pedestrian, d _n A pedestrian predicted distance representing an nth detection result;

5.2 All confidence s are assigned to a confidence threshold of 0.5 _n Detecting results below the confidence threshold are deleted as results;

5.3 All the results processed in the step 5.2) are sorted according to the confidence coefficient s from large to small, and the position coordinate frame R of the second and the subsequent detection results is sorted according to a formula _k ({x _n ，y _n ，w _n ，h _n }) and the result R of the first sorting ₁ Calculating IOU, wherein k is more than 1; the expression of the formula is:

in the formula, IOU _x，k Representing the ratio of the overlapping area to the merging area of the candidate box at the 1 st position and the candidate box at the k-th position, area (R) ₁ ∩R _k ) Area, area (R) representing the intersection area of the candidate box at the 1 st position and the candidate box at the k-th position ₁ ∪R _k ) Representing the area of the merging region of the candidate box at the 1 st position and the candidate box at the k-th position;

5.4 Set the IOU threshold of 0.5, delete all results above the IOU threshold;

5.5 After the step 5.4) is finished, taking out the first result of the confidence degree sequencing as a correct result to be output, and circularly operating the rest results according to 5.3) and 5.4) until the number of sequencing results is less than or equal to 1; all the finally obtained results are the final output results.

Compared with the prior art, the invention has the following beneficial effects: the invention can predict the distance between the pedestrian and the camera while detecting the position of the pedestrian, the AP for detecting the pedestrian is more than 95%, and the relative error of the prediction of the distance between the pedestrian and the camera is less than 10%.

Drawings

FIG. 1 is a flowchart of a pedestrian distance prediction method based on a monocular camera according to an embodiment of the present invention;

FIG. 2 is an example of a training sample in an embodiment of the present invention; wherein, fig. 2 (a) is a picture of a training sample, and fig. 2 (b) is a label file of the training sample;

FIG. 3 is a network architecture diagram of a pedestrian distance prediction branch in an embodiment of the present invention;

fig. 4 shows the pedestrian detection and distance prediction results in the embodiment of the present invention.

Detailed Description

The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are given to illustrate the present invention, but are not intended to limit the scope of the present invention.

Referring to fig. 1, the present invention provides a pedestrian distance prediction method based on a monocular camera, including the following steps:

placing the camera at a take the altitude, combining the internal parameter of camera, utilizing aperture principle of imaging and triangle-shaped similar principle, determining pedestrian's head height-pedestrian camera apart from the model, supposing pedestrian's head height is h, pedestrian apart from the camera actual distance be d, then can obtain a conversion coefficient a for d = h a.

Step 2: marking pedestrian detection and pedestrian distance to obtain a training sample set;

the method is used for generating a data set for training a deep convolutional neural network model, and comprises the following specific steps:

2.1 Carrying out frame extraction processing on the video acquired in the step 1, wherein the frame extraction interval is 25 frames, and obtaining an initial frame picture of the video.

2.2 Using a motion blur algorithm to judge the initial frame picture and removing a blur picture in the initial frame picture; the motion blur algorithm can select a Laplacian method: firstly, laplacian transformation is carried out on a picture, then the variance of the picture after Laplacian transformation is obtained, whether an initial frame picture is fuzzy or not is judged according to the size of the variance, and if the variance is smaller than a set threshold value, the picture is regarded as fuzzy;

2.3 Marking the initial frame picture after removing the fuzzy frame, marking the head of the pedestrian and the head of the pedestrian when marking, marking the type of the pedestrian as person, and marking the head of the pedestrian as head. Labeling can be directly carried out by using a labellimg tool, and each image generates an xml file with a corresponding Pascal VOC format after the labeling is finished, wherein the labeled image is shown as a picture in a figure 2 (a);

2.4 Processing the initial xml file after labeling, converting the model of the pedestrian head height-the pedestrian camera distance obtained in the step 1 according to the pixel height of the head, obtaining the distance between the pedestrian and the camera through conversion, adding the distance into the xml file as a dist attribute, deleting the head position of the pedestrian of the head of the initial xml at the same time, and finally obtaining a final xml file, wherein the content of the xml file is shown in fig. 2 (b).

And step 3: and adding a structural branch for pedestrian distance regression prediction on the deep learning target detection model RetinaNet, and constructing to obtain a convolutional neural network model.

The step 3 specifically comprises the following steps:

3.1 Modifying a data reading part in RetinaNet (mainly adding reading of distance labels in xml files, reading of pedestrian distance is needed besides reading of position coordinates of pedestrians), and adding support for the pedestrian distance information as the distance information of each pedestrian is newly added in xml in a data set and the original RetinaNet does not contain processing of the pedestrian distance information;

3.2 A pedestrian distance prediction branch is added, the distance prediction branch is similar to a coordinate regression branch and is composed of 5 convolution layers, the size of each convolution kernel is 3 x 3, the convolution step is 1 and the filling is 1, the output of the convolution layers in the first four layers is different in that the number of convolution output channels in the last layer of the distance regression branch is the number of Anchor frames (Anchor) multiplied by 4, and the number of output channels of the distance prediction branch is the number of Anchor frames (Anchor). Selecting ResNet50 as a basic network for extracting the characteristics of the target detection network, performing characteristic fusion on the extracted characteristics by using a characteristic pyramid with 5 layers, and connecting a coordinate regression branch, a target category branch and a distance prediction branch behind each characteristic layer. The number of channels of 5 layers of feature maps in the feature pyramid FPN is 256, the sizes of the feature maps are 100 × 136, 50 × 68, 25 × 34, 13 × 17 and 7 × 9, respectively, fig. 3 is a pedestrian distance prediction branch structure diagram, and a specific structure of a pedestrian distance prediction branch is drawn in the diagram by taking the layer 1 of the FPN as an input.

And 4, step 4: training the convolutional neural network model obtained in the step 3 through the training sample set obtained in the step 2 to obtain a trained convolutional neural network model which is used as a pedestrian detection and distance prediction model;

the step 4 specifically comprises the following steps:

4.1 MSELoss is added to the net total Loss function (Loss) to constrain the pedestrian distance branch fraction, and is calculated as l (x, y) = (x-y) ² (ii) a y is the pedestrian distance x predicted by the model, and is the mark, namely the actual pedestrian distance; the trained loss function consists of 3 parts: focalLoss of pedestrian confidence, smoothL1Loss of coordinate regression, mean square error Loss of distance regression (mselos). Because the value obtained by the mean square error loss of the distance regression is large, a proportionality coefficient needs to be multiplied (the proportionality coefficient needs to be adjusted according to the magnitude of the actual distance prediction part loss, which is 0.004 during the actual training in the embodiment) to reduce the proportion of the mean square error loss of the distance regression in the total loss, so that 3 losses are kept in the same order of magnitude;

4.2 Model training, wherein a frame used in the training is Pythrch, a basic network Resnet50 of the model uses a pre-training model on an ImageNet classification task, a Fine-tuning strategy is adopted for training on the basis of the model, an optimization algorithm used in the training is a small-batch random gradient descent algorithm, a total of 24 epochs are trained, and the number of samples in each batch is 4. The trained convolutional neural network model has the functions of pedestrian detection and pedestrian distance prediction;

and 5: and (4) inputting the pictures acquired from the monocular camera into the pedestrian detection and distance prediction model obtained in the step (4) to obtain an output result of the model, and further performing non-maximum suppression on the output result of the model to obtain the coordinates, confidence and predicted distance of the finally detected pedestrian.

The step 5 specifically comprises the following steps:

5.1 Input the pictures collected from the monocular camera into the pedestrian detection and distance prediction model to obtain the output result of the network, wherein the output result is n vectors with 6 dimensions output by the network after the detection is finished, and the nth vector of the detection result is R _n ＝{x _n ，y _n ，w _n ，h _n ，s _n ，d _n Therein { x } _n ，y _n ，w _n ，h _n Respectively represent the coordinates (x) of the upper left corner of the coordinate frame corresponding to the nth detection result _n ，y _n ) And width w of the frame _n And height h _n ，s _n Confidence that the nth detection result is a pedestrian, d _n A pedestrian predicted distance representing an nth detection result;

5.2 All confidence s are assigned to a confidence threshold of 0.5 _n Detection results below the threshold are deleted as a result;

5.3 All results are sorted in order of increasing confidence s, and the second and subsequent candidate frames R are sorted according to a formula _k K > 1 and the first result R ₁ Calculating IOU, wherein the formula expression is as follows:

in the formula, IOU _1，k Representing the ratio of the overlapping area to the merging area of the candidate box at the 1 st position and the candidate box at the k-th position, area (R) ₁ ∩R _k ) Area, area (R) representing the intersection area of the candidate box at the 1 st position and the candidate box at the k-th position ₁ ∪R _k ) Substitute for Chinese traditional medicineListing the area of a union region of the candidate box at the 1 st position and the candidate box at the k th position;

5.4 Set the IOU threshold of 0.5, delete all results above the threshold;

5.5 After the step 5.4) is finished, the first result of the confidence degree S sorting is taken out and output as a correct result, and the rest results are circularly operated according to the steps 5.3) and 5.4) until the number of the sorting results is less than or equal to 1. All the finally obtained results are the final output results. Fig. 4 is an example of the model detection result, where the red frame is a real pedestrian frame, the red text is a real pedestrian distance value, the green frame is the model prediction result, and the green text is the confidence and distance prediction value of the pedestrian. Fig. 4 is an example of a model detection result, in which a dotted line frame is a real pedestrian frame, oblique bold characters are real pedestrian distance values, a solid line frame is a model prediction result, and non-oblique bold characters are a pedestrian confidence and distance prediction values.

The experimental result shows that the technical scheme can predict the distance between the pedestrian and the camera while detecting the position of the pedestrian, the pedestrian detection AP is more than 95%, and the relative error of the pedestrian distance prediction is less than 10%.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. A pedestrian distance prediction method based on a monocular camera is characterized by comprising the following steps:

and 5: inputting the pictures collected by the monocular camera into the pedestrian detection and distance prediction model obtained in the step 4 to obtain the coordinates, scores and distances of pedestrians, and completing pedestrian distance prediction based on the monocular camera;

step 3, a deep learning target detection model is YOLO, faster-RCNN, SSD or RetinaNet;

the step 3 specifically comprises the following steps:

3.1 Modifying a data reading portion in the deep learning target detection model;

3.2 Adding a pedestrian distance prediction branch into the modified deep learning target detection model in the step 3.1);

in the step 3.2), the distance prediction branch consists of 5 convolutional layers, the size of each convolutional kernel is 3 multiplied by 3, the convolutional step length is 1, the padding is 1, the output of the first four convolutional layers is output, and the number of the last convolutional output channels is the number of anchor frames; selecting ResNet50 as a basic network for extracting the characteristics of the target detection network, performing characteristic fusion on the extracted characteristics by using a characteristic pyramid with 5 layers, and connecting a coordinate regression branch, a target category branch and a distance prediction branch behind each characteristic layer; the number of channels of the 5-layer feature map in the feature pyramid is 256, and the sizes of the feature map are 100 × 136, 50 × 68, 25 × 34, 13 × 17 and 7 × 9 respectively.

2. The pedestrian distance prediction method based on the monocular camera as claimed in claim 1, wherein step 1 uses the monocular camera, fixes the height of the camera, and determines the pedestrian head height-pedestrian camera distance model by combining the internal parameters of the camera and using the pinhole imaging principle and the triangle similarity principle: assuming that the height of the head of the pedestrian is h and the actual distance between the pedestrian and the camera is d, a conversion coefficient a is obtained, so that d = h × a.

3. The pedestrian distance prediction method based on the monocular camera according to claim 1, wherein the step 2 specifically includes:

2.2 Using a motion blur algorithm to judge the initial frame picture and removing a blur picture in the initial frame picture;

4. The pedestrian distance prediction method based on the monocular camera according to claim 1, wherein the step 4 specifically includes:

4.1 Adds MSELoss to the network total Loss function Loss to constrain the pedestrian distance branch fraction, the MSELoss is calculated as l (x, y) = (x-y) ² (ii) a The network total loss function consists of 3 parts: focalLoss of the confidence coefficient of the pedestrian, smoothL1Loss of coordinate regression, and mean square error Loss MSELoss of distance regression;

4.2 Pytorch is used during training, a pre-training model on an ImageNet classification task is used by a basic network ResNet50 of the model, a Fine-tuning strategy is adopted for training on the basis of the model, a small-batch random gradient descent algorithm is used during training, a total of 24 epochs are trained, and the number of samples in each batch is 4; the trained convolutional neural network model has the functions of pedestrian detection and pedestrian distance prediction at the same time.

5. The pedestrian distance prediction method based on the monocular camera according to claim 1, wherein the step 5 specifically includes:

5.1 Input the pictures collected from the monocular camera into the pedestrian detection and distance prediction model to obtain the output result of the network, wherein the output result is n 6-dimensional vectors output by the network after the detection is finished, and the nth detection result vector is R _n ＝{x _n ,y _n ,w _n ,h _n ,s _n ,d _n In which { x } _n ,y _n ,w _n ,h _n Respectively representing the coordinates of the upper left corner of the coordinate frame corresponding to the nth detection result and the width and the height of the frame, s _n Confidence that the nth detection result is a pedestrian, d _n A pedestrian predicted distance representing an nth detection result;

5.3 All the results processed in the step 5.2) are sorted according to the confidence coefficient s from large to small, and the position coordinate frame R of the second and the subsequent detection results is sorted according to a formula _k And sorting the first result R ₁ Computing IOU, k>1; the expression of the formula is:

in the formula, IOU _1,k Representing the ratio of the overlapping area to the merging area of the candidate box at the 1 st position and the candidate box at the k-th position, area (R) ₁ ∩R _k ) Area, area (R) representing the intersection area of the candidate box at the 1 st position and the candidate box at the k-th position ₁ ∪R _k ) Representing the area of the union region of the candidate box at the 1 st position and the candidate box at the k-th position;

5.4 Set the IOU threshold of 0.5, delete all results above the IOU threshold;

5.5 5.4) after the step is finished, taking out the first result of the confidence ranking as a correct result to be output, and circularly operating the rest results according to 5.3) and 5.4) until the number of the ranking results is less than or equal to 1; all the finally obtained results are the final output results.