CN110751056A

CN110751056A - Pedestrian motion prediction method based on improved top-down method multi-person posture detection

Info

Publication number: CN110751056A
Application number: CN201910921085.XA
Authority: CN
Inventors: 张子蓬; 刘逸凡; 李昌平; 庆毅辉; 周博文; 马烨; 王晨曦; 兰天泽; 王淑青
Original assignee: Hubei University of Technology
Current assignee: Hubei University of Technology
Priority date: 2019-09-27
Filing date: 2019-09-27
Publication date: 2020-02-04
Anticipated expiration: 2039-09-27
Also published as: CN110751056B

Abstract

The invention discloses a pedestrian motion prediction method based on improved top-down method multi-person posture detection, which extracts a human body posture block diagram, bone points and postures by processing images processed by a spatial transformation network, improved top-down method single-person posture detection and an inverse spatial transformation network, predicts and outputs the next actions of pedestrians by carrying out optical flow processing and training of a long-term and short-term memory neural network on the bone points and postures of the human body.

Description

Pedestrian motion prediction method based on improved top-down method multi-person posture detection

Technical Field

The invention belongs to the technical field of image processing, relates to a multi-person posture detection method applied to an automatic assistant driving system, and particularly relates to a method for detecting and predicting pedestrian movement by adopting improved multi-person postures, which improves the accuracy and the real-time performance of posture detection under the condition of pedestrian coincidence.

Background

The pedestrian detection technology is widely applied to the field of advanced assistant driving, in the field, the pedestrian detection is often good in environmental conditions, the detection effect is obvious under the condition that pedestrians are not overlapped, the pedestrian detection under the complex conditions is always important research content, and an algorithm for multi-person posture detection based on an improved top-down method is provided and applied to real-time human motion prediction.

Currently, there are two mainstream algorithms for multi-person attitude estimation: a top-down method (Two-step frame) of detecting each human detection frame in an environment first and then independently detecting the pose of each human boundary, and a bottom-based method (Part-based frame) of extremely depending on the pose detection accuracy and also repeatedly estimating the bounding box of a single person due to redundant detection frames; the latter method is to detect all the body nodes in the environment first, then to splice to get the skeleton of many people, but because this method depends on the body nodes of people, when two people are very close to each other, the wrong connection is very easy to happen.

Conventional position checking (SPPE) is highly susceptible to false bounding boxes, and redundant bounding boxes can produce redundant positions, and although most advanced pedestrian recognizers have shown good performance, small errors in positioning and cognition are inevitable, and these errors can cause errors in position checking, especially in methods that rely solely on human detection results, and cannot meet the requirements of current assisted driving systems.

Disclosure of Invention

The invention aims to solve the problem that the existing gesture recognition is inaccurate in detection under the condition that multiple persons are overlapped, and provides a pedestrian motion prediction method based on improved top-down method multiple person gesture detection, so that the gesture estimation can be conveniently carried out under the condition that a human body boundary frame is inaccurate.

The technical scheme adopted by the invention is as follows: a pedestrian motion prediction method based on improved top-down method multi-person posture detection is characterized by comprising the following steps:

step 1: inputting an original multi-person pedestrian video;

step 2: performing pedestrian boundary box SSD processing on the input video to obtain a pedestrian boundary box b;

and step 3: performing space network transformation on the pedestrian boundary frame b obtained in the step 2, and extracting a high-quality human body region frame;

and 4, step 4: carrying out single posture detection on each high-quality human body region frame to obtain a redundant bone point confidence coefficient E;

and 5: carrying out redundancy elimination treatment on the bone points E with redundancy;

step 6: mapping the human body area frame obtained in the step 3 to an original image coordinate to obtain a high-quality area frame in the original image coordinate;

and 7: processing the current frame and the previous N frames in the steps 1 to 4 to respectively obtain bone points E (d) of N +1 pictures_i) I 1, 2,., N +1, and drawing the motion track of each bone point for the N +1 bone points;

and 8: performing optical flow processing on the N +1 pictures to obtain a displacement vector epsilon (d) of each bone point;

and step 9: the bone point d of each frame_jPredicting the new bone point v connection of the next frame to obtain a human skeleton diagram;

step 10: n +1 consecutive frames E (d) obtained in step 7_i) Andthe bone point displacement offset epsilon (d) after the optical flow processing in the step 8 is transmitted to a long-short term memory neural network LSTM for training a model;

step 11: generating a skeleton block diagram E every M times of training in the step 10_fSaid E is_fNamely the real-time pedestrian motion prediction.

The pedestrian posture detection method has the advantages of high pedestrian posture detection rate and strong adaptability, and can predict the action of the pedestrian after the preset time (a large number of experiments prove that the maximum undistorted result is the predicted time of 0.5 second), and the specific expression is as follows:

1) the invention uses the improved multi-person posture detection to improve the accuracy of the pedestrian posture detection under the condition of complicated overlapping and overcome the problem of poor accuracy of the traditional multi-person posture detection.

2) The pedestrian frame diagram is subjected to one redundant screening, the pedestrian bone point diagram is also subjected to one redundant screening, and the calculation time is within an allowable range.

3) The invention predicts the action of the pedestrian in seconds after the preset time on the high-precision multi-person posture detection and has high accuracy.

Drawings

FIG. 1 is an algorithmic flow diagram of an embodiment of the present invention;

FIG. 2 is a schematic diagram of a spatial transform network processing procedure according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating a result of single person gesture detection processing in an embodiment of the present invention;

FIG. 4 is a schematic diagram of an inverse spatial transform network processing procedure according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating the result of streamer processing in an embodiment of the invention;

FIG. 6 is a diagram illustrating a predicted posture of a pedestrian 0.5 seconds later in an embodiment of the present invention.

Detailed Description

In order to facilitate the understanding and implementation of the present invention for those of ordinary skill in the art, the present invention is further described in detail with reference to the accompanying drawings and examples, it is to be understood that the embodiments described herein are merely illustrative and explanatory of the present invention and are not restrictive thereof.

Referring to fig. 1, the method for predicting pedestrian movement based on the improved top-down method multi-person posture detection provided by the invention comprises the following steps:

step 1: inputting an original multi-person pedestrian video.

Step 2: performing pedestrian boundary box SSD processing on the input video to obtain a less accurate pedestrian boundary box b, wherein the processing mode is as follows:

b＝(b^cx，b^cy，b^w，b^h)＝(d^wl^cx+d^cx，d^yl^cy+d^cy，d^wexp(l^w)，d^hexp(l^h))

and is

Where d denotes the position of the prior box, l denotes the predicted position of the bounding box, i^cx，i^cyRespectively representing the abscissa and ordinate of the center of the bounding box i, i^w，i^hWhich respectively represent the width and height of the bounding box i, i can be b, d, l.

And step 3: as shown in FIG. 2, a high-quality human body region frame is extracted from the less accurate pedestrian boundary frame b by using the Space Transformation Network (STN)The treatment method comprises the following steps:

wherein, theta₁，θ₂，θ₃All reflect the human body region frame

The vector coefficients of the coordinate relationship before and after transformation,

the coordinates of the region box after the transformation for the spatial transformation network STN.

And 4, step 4: as shown in fig. 3, for each high quality body region box

Adopting CNN single posture detection (SPPE) to obtain a confidence level E of a bone point with redundancy, wherein the higher the confidence level is, the more likely it is a correct human bone point, and the processing mode is as follows:

wherein d is_j1、d_j2Respectively the position of two bone points, L_cIs a line segment composed of two bone points, u is the intermediate coefficient for calculating the integral, and u belongs to [0, 1 ]]P (u) is two bone points d_j1、d_j2The calculation method of the interpolation is as follows:

p(u)＝(1-u)d_j1+ud_j2

and 5: eliminating redundant impurities from bone points E with redundant impurities, and selecting the bone point E with the maximum confidence coefficient_maxFor reference, a threshold is defined η -90% as a criterion for which bone points that are relatively close and similar are eliminated (d)_i，d_j) And then:

if E (d)_i，d_j) An output of 1 indicates a bone point d_iIs redundantMiscellaneous, should be eliminated; if E (d)_i，d_j) An output of 0 indicates a bone point d_jAre redundant and should be eliminated.

Step 6, as shown in FIG. 4, frame the human body region

Mapping to original image coordinate, namely processing the original image coordinate by using inverse space transformation network (STDN) to obtain high-quality area frame in the original image coordinate

The treatment method is as follows:

wherein

[γ₁γ₂]＝[θ₁θ₂]^-1

γ₃＝-1×[γ₁γ₂]θ₃

Wherein [ theta ]₁θ₂]The calculation method comprises the following steps:

θ₃the calculation method comprises the following steps:

wherein, W represents a matrix formed by each dimension of the input layer and the output layer of the inverse space transformation network, and J (W, b) represents the position of the pedestrian boundary box b in the inverse space transformation network.

And 7: processing the continuous 5 frames of pictures (the current frame and the previous 4 frames) in the steps 1 to 4 to respectively obtain the bone points E (d) of the 5 pictures_i) And i is 1, 2, 3, 4 and 5, and the motion track of each bone point is drawn for the 5 bone points.

And 8: as shown in fig. 5, the 5 pictures are subjected to optical flow processing to obtain displacement vectors ∈ (d) of each bone point, and the processing method is as follows:

v＝u+d＝[u_x+d_xu_y+d_y]^T

wherein v is the new position of the bone point in the next frame, u represents the position of the bone point, (x, y) represents the coordinates of the bone point, and (u) represents the coordinates of the bone point_x，u_y) Coordinates representing the bone point of the next frame, d_x、d_yRespectively representing the distance between the bone point and the next frame bone point, I (x, y) representing the pedestrian boundary box where the current bone point is located, J (x + d)_x，y+d_y) A pedestrian boundary box, w, representing the location of the next frame bone point_x、w_yTo define two constants of the window for integration of the optical flow method, the size of which determines the time complexity and the effect of the algorithm, in general, w_x、w_yThe smaller the effect, the better, but the greater the temporal complexity.

And step 9: the bone point d of each frame_jAnd predicting the new bone point v connection of the next frame to obtain a human skeleton diagram.

Step 10: five consecutive frames E (d) of step 7_i) And step 8, transmitting the bone point displacement offset epsilon (d) after the optical flow processing to a long-short term memory neural network (LSTM) for training a model.

Step 11: as shown in FIG. 6, a skeleton map E is generated every 6 times of training in step 9, i.e. every 30 frames 0.5 seconds in advance_fSaid E is_fNamely the real-time pedestrian motion prediction.

It should be understood that parts of the specification not set forth in detail are prior art; the above description of the preferred embodiments is intended to be illustrative, and not to be construed as limiting the scope of the invention, which is defined by the appended claims, and all changes and modifications that fall within the metes and bounds of the claims, or equivalences of such metes and bounds are therefore intended to be embraced by the appended claims.

Claims

1. A pedestrian motion prediction method based on improved top-down method multi-person posture detection is characterized by comprising the following steps:

step 1: inputting an original multi-person pedestrian video;

step 10: n +1 consecutive frames E (d) obtained in step 7_i) And step 8, the bone point displacement offset epsilon (d) after the optical flow processing is transmitted to a long-short term memory neural network LSTM for training a model;

2. The method for predicting the pedestrian motion based on the multi-person posture detection of the improved top-down method according to claim 1, wherein the concrete implementation process of the step 2 is as follows:

assuming that a pedestrian boundary box is obtained as b, then:

b＝(b^cx，b^cy，b^w，b^h)＝(d^wl^cx+d^cx，d^hl^cy+d^cy，d^wexp(l^w)，d^hexp(l^h))

and is

3. The method for predicting the pedestrian motion based on the multi-person posture detection of the improved top-down method according to claim 1, wherein the concrete implementation process of the step 3 is as follows: extracting a high-quality human body region frame from the pedestrian boundary frame b by adopting a space transformation network STN

Wherein, theta₁，θ₂，θ₃All reflect the human body region frame

The vector coefficients of the coordinate relationship before and after transformation,the coordinates of the region box after the transformation for the spatial transformation network STN.

4. The method for predicting the pedestrian motion based on the multi-person posture detection of the improved top-down method according to claim 1, wherein the specific implementation process of the step 4 is as follows: for each high quality body region frame

Adopting CNN single posture to detect SPPE to obtain a redundant bone point confidence E;

p(u)＝(1-u)d_j1+ud_j2。

5. the method for predicting the pedestrian motion based on the multi-person posture detection of the improved top-down method according to claim 1, wherein the concrete implementation process of the step 5 is as follows: selecting the bone point E with the maximum confidence_maxFor reference, defining η as a standard threshold, then:

if E (d)_i，d_j) An output of 1 indicates a bone point d_iAre redundant and should be eliminated; if E (d)_i，d_j) An output of 0 indicates a bone point d_jAre redundant and should be eliminated.

6. The method for predicting the pedestrian motion based on the multi-person posture detection of the improved top-down method according to claim 1, wherein the specific implementation process of the step 6 is as follows: frame the human body regionMapping to the original image coordinate, namely processing the original image coordinate by adopting an inverse space transformation network STDN to obtain a high-quality area frame in the original image coordinate

Wherein

[γ₁γ₂]＝[θ₁θ₂]^-1

γ₃＝-1×[γ₁γ₂]θ₃

Wherein, theta₁，θ₂，θ₃All reflect the human body region frame

Vector coefficient of coordinate relation before and after transformation, [ gamma ]₁γ₂γ₃]Is [ theta ] of₁θ₂θ₃]Transposing;

[θ₁θ₂]the calculation method comprises the following steps:

θ₃the calculation method comprises the following steps:

7. The method for predicting the pedestrian motion based on the multi-person gesture detection of the improved top-down method according to claim 1, wherein in the step 8:

v＝u+d＝[u_x+d_xu_y+d_y]^T

wherein v is the new position of the bone point in the next frame, u represents the position of the bone point, (x, y) represents the coordinates of the bone point, and (u) represents the coordinates of the bone point_x，u_y) Coordinates representing the bone point of the next frame, d_x、d_yRespectively representing the distance between the bone point and the next frame bone point, I (x, y) representing the pedestrian boundary box where the current bone point is located, J (x + d)_x，y+d_y) A pedestrian boundary box, w, representing the location of the next frame bone point_x、w_yThe size of the two constants defining the integration window of the optical flow method determines the time complexity and the effect of the algorithm.