CN110516613A

CN110516613A - A Pedestrian Trajectory Prediction Method from the First View

Info

Publication number: CN110516613A
Application number: CN201910807214.2A
Authority: CN
Inventors: 刘洪波; 李伯林; 江同棒; 张博; 汪大峰; 戴光耀; 李科; 林正奎
Original assignee: Dalian Maritime University
Current assignee: Dalian Maritime University
Priority date: 2019-08-29
Filing date: 2019-08-29
Publication date: 2019-11-29
Anticipated expiration: 2039-08-29
Also published as: CN110516613B

Abstract

The invention discloses a method for predicting pedestrian trajectories from a first perspective, which adopts a codec structure combined with a circular convolution network to predict pedestrian trajectories from a first perspective. The original image is encoded to obtain the feature vector of pedestrian trajectory information, and then the feature vector is decoded to predict the future trajectory information of pedestrians. In the public data set and the data set collected by itself, the present invention will accurately predict the trajectory information of multiple pedestrians in the next 10 frames, and the L2 distance error between the final predicted trajectory and the final actual trajectory is increased to 40, which is higher than the current There are ways to improve the 30 pixel accuracy. The present invention proposes a time-space convolutional loop network method for predicting pedestrian trajectories, uses one-dimensional convolution for encoding and decoding processing, and predicts through the space-time convolution network. In the current related methods, the implementation is relatively simple, and the data acquisition and processing are clear and concise. , Strong practicability.

Description

A Pedestrian Trajectory Prediction Method from the First View

技术领域technical field

本发明涉及一种行人轨迹预测方法，特别是一种第一视角下的行人轨迹预测方法。The invention relates to a method for predicting pedestrian trajectories, in particular to a method for predicting pedestrian trajectories from a first perspective.

背景技术Background technique

在自动驾驶和机器人技术蓬勃发展的今天，通过车载摄像头获取车辆周围环境信息，预测视频中行人轨迹信息，控制车辆驾驶行为，做出更加合理的路径规划以进行障碍物、行人规避，是一个十分重要的任务。Today, with the rapid development of autonomous driving and robot technology, it is a very important task to obtain the surrounding environment information of the vehicle through the on-board camera, predict the trajectory information of pedestrians in the video, control the driving behavior of the vehicle, and make a more reasonable path planning to avoid obstacles and pedestrians. important task.

非第一视角，如监控摄像头下行人的轨迹预测不必考虑摄像头自身的运动对于行人轨迹预测的影响，比如监控摄像头里行人检测框在视频中越来越大表明行人走向摄像头。但第一视角区别监控视频等固定拍摄的视频，机器人或拍摄者自身的运动，直接影响视频中行人信息的获取和预测。第一视角属于运动视角，拍摄者自身也在运动，这种运动会影响对于行人的未来行为的判断，比如在第一视角下，行人越来越大，那么就不能确定行人是向摄像头运动还是摄像头在靠近行人，行人轨迹预测也是不准确的。Non-first perspective, such as the trajectory prediction of pedestrians under the surveillance camera does not need to consider the impact of the camera’s own motion on the pedestrian trajectory prediction. For example, the pedestrian detection frame in the surveillance camera is getting larger and larger in the video, indicating that pedestrians are walking towards the camera. However, the first-person perspective distinguishes fixed-shot videos such as surveillance videos, and the movement of robots or the photographer itself directly affects the acquisition and prediction of pedestrian information in the video. The first perspective is a movement perspective, and the photographer himself is also moving. This movement will affect the judgment of the future behavior of the pedestrian. In close proximity to pedestrians, pedestrian trajectory prediction is also inaccurate.

发明内容Contents of the invention

为解决现有技术存在的上述问题，本发明要提出一种能提升第一视角下行人的轨迹预测精度的第一视角下的行人轨迹预测方法。In order to solve the above-mentioned problems in the prior art, the present invention proposes a pedestrian trajectory prediction method in the first perspective that can improve the trajectory prediction accuracy of pedestrians in the first perspective.

本发明的思路是：基于编解码结构的模型，通过引入行人位置信息、自我运动历史信息和循环卷积网络预测行人未来的轨迹，并通过加入自我运动的未来信息来有效的提升视频中行人未来轨迹预测的精度。The idea of the present invention is: based on the codec structure model, by introducing pedestrian position information, self-motion history information and circular convolution network to predict the future trajectory of pedestrians, and by adding the future information of self-motion to effectively improve the future of pedestrians in the video Accuracy of trajectory prediction.

为了实现上述目的，本发明的技术方案如下：一种第一视角下的行人轨迹预测方法，包括以下步骤：In order to achieve the above object, the technical solution of the present invention is as follows: a pedestrian trajectory prediction method under the first viewing angle, comprising the following steps:

A、网络编码器编码得到轨迹特征A. Network encoder encoding to obtain trajectory features

A1、行人头部佩戴或手持运动摄像头，实时获取第一视角下的录取的视频；A1. Pedestrians wear or hold motion cameras on their heads to obtain recorded videos from the first perspective in real time;

A2、把视频按照每秒k帧的帧率分成若干幅图像，k的范围是5～20；A2. Divide the video into several images at a frame rate of k frames per second, where k ranges from 5 to 20;

A3、通过处理器处理步骤A2中分好的图像，经过以下步骤获取行人位置特征向量：A3, the image divided in step A2 is processed by the processor, and the pedestrian position feature vector is obtained through the following steps:

A31、通过标记工具对图像中行人进行标记，标记出行人检测框；A31. Use the marking tool to mark the pedestrians in the image, and mark the pedestrian detection frame;

A32、通过时间窗口采样算法对步骤A31中标记的行人检测框进行矫正。由于图像空间中坐标原点在图像的左上角，横轴坐标x值从左向右递增，纵轴坐标y值从上到下递增，所以取行人检测框左上角位置信息(x_i ^min，y_i ^min)^T及右下角位置信息(x_i ^max，y_i ^max)^T作为行人轨迹数据。将连续n帧所包括的所有行人的轨迹序列作为一组训练样本，n的范围是10～20，每个行人的训练样本记作L_in：A32. Correct the pedestrian detection frame marked in step A31 through the time window sampling algorithm. Since the coordinate origin in the image space is in the upper left corner of the image, the x value of the horizontal axis coordinate increases from left to right, and the y value of the vertical axis coordinate increases from top to bottom, so the position information of the upper left corner of the pedestrian detection frame (x _i ^min , y _i ^min ) ^T and the position information of the lower right corner ( _xi ^max , y _i ^max ) ^T are used as pedestrian trajectory data. Take the trajectory sequences of all pedestrians included in n consecutive frames as a set of training samples, the range of n is 10-20, and the training samples of each pedestrian are denoted as L _in :

其中，l_i＝(x_i ^min，y_i ^min，x_i ^max，y_i ^max)∈R⁴，i的取值范围是t_current-T_his～t_current。t_current为当前时刻，T_his表示历史帧范围，T_his取值为5～20。Wherein, l _i =( _xi ^min , y _i ^min , _xi ^max , y _i ^max )∈R ⁴ , and the value range of i is t _current -T _his ~t _current . t _current is the current moment, T _his represents the historical frame range, and the value of T _his is 5-20.

A33、构建行人位置特征提取卷积网络，对行人位置和检测框大小处理得到行人位置特征向量L_in ^F：A33. Construct a pedestrian position feature extraction convolutional network, and process the pedestrian position and the size of the detection frame to obtain the pedestrian position feature vector L _in ^F :

L_in ^F＝(lf₁，..，lf_m)，Lin ^F = ( _lf ₁ , .., lf _m ),

其中，lf_i表示行人位置特征的第i个特征值。Among them, lf _i represents the i-th eigenvalue of the pedestrian position feature.

所述行人位置特征提取卷积网络结构采用4层结构，先把输入数据L_in输入第一层Conv1d一维卷积层，第一层Conv1d一维卷积层输出结果输入第二层Convld一维卷积层，第二层Conv1d一维卷积层输出结果输入第三层Convld一维卷积层，第三层Conv1d一维卷积层输出结果输入第四层Conv1d一维卷积层，第四层Conv1d一维卷积层输出结果得到特征向量L_in ^F；每层的输出结果都进行BN批量归一化处理并经过Relu激活函数激活。The pedestrian position feature extraction convolutional network structure adopts a 4-layer structure, and the input data Lin is first input to the first layer of _Conv1d one-dimensional convolution layer, and the output result of the first layer of Conv1d one-dimensional convolution layer is input to the second layer of Convld one-dimensional Convolution layer, the output result of the second layer Conv1d one-dimensional convolution layer is input to the third layer Convld one-dimensional convolution layer, the output result of the third layer Conv1d one-dimensional convolution layer is input to the fourth layer Conv1d one-dimensional convolution layer, the fourth layer The output result of the one-dimensional convolution layer of layer Conv1d obtains the feature vector L _in ^F ; the output result of each layer is processed by BN batch normalization and activated by the Relu activation function.

A4、获取摄像头自我运动历史特征向量；A4. Obtain the camera self-motion history feature vector;

A41、通过Structure From Motion算法获得当前帧图像相对于前一帧图像的摄像头自我运动信息。所述摄像头自我运动信息包括摄像头自身旋转信息的欧拉角r_t∈R³和速度信息v_t∈R³。所述欧拉角包括偏航角ψ、滚转角Φ和俯仰角θ，所述速度信息包括摄像头即时速度在三维坐标轴上的投影v_x，v_y，v_z。摄像头自我运动历史特征向量记作E_H；A41. Obtain the camera self-motion information of the current frame image relative to the previous frame image through the Structure From Motion algorithm. The camera self-motion information includes the Euler angle r _t ∈ R ³ of the camera self-rotation information and the velocity information v _t ∈ R ³ . The Euler angles include yaw angle ψ, roll angle Φ, and pitch angle θ, and the speed information includes projections v _x , v _y , v _z of the instant speed of the camera on three-dimensional coordinate axes. The camera ego-motion history feature vector is denoted as E _H ;

其中，e_t＝(r_t ^T，v_t ^T)^T∈R⁶，t的取值范围是t_current-T_his～t_current。Wherein, e _t =(r _t ^T , v _t ^T ) ^T ∈ R ⁶ , and the value range of t is t _current −T _his ~t _current .

A42、构建4层摄像头自我运动历史特征提取卷积网络，提取摄像头自我运动历史信息EH的特征，得到摄像头自我运动历史特征向量E_H ^F；A42. Construct a 4-layer camera self-motion history feature extraction convolutional network, extract the features of the camera self-motion history information EH, and obtain the camera self-motion history feature vector E _H ^F ;

E_H ^F＝(ef₁，...，ef_n)E _H ^F = (ef ₁ , . . . , ef _n )

其中，ef_i表示摄像头自我运动历史特征的第i个特征值。Among them, ef _i represents the i-th eigenvalue of the camera ego-motion history feature.

所述摄像头自我运动历史特征提取卷积网络结构采用行人位置特征提取卷积网络相同的结构。The convolutional network structure for extracting self-motion history features of the camera adopts the same structure as the pedestrian position feature extraction convolutional network.

A5、将L_in ^F和E_H ^F两个向量首尾相连连接在一起，得到特征向量LE^F：A5. Connect the two vectors L _in ^F and E _H ^F end to end to get the feature vector LE ^F :

LE^F＝(lf₁，..，lf_m，ef₁，…，ef_n)LE ^F = (lf ₁ , .., lf _m , ef ₁ , . . . , ef _n )

A6、获取摄像头自我运动未来特征向量；A6. Obtain the future feature vector of the camera's self-motion;

A61、采用与步骤A41相同的方法，通过Structure From Motion算法得到运动摄像头自我运动未来的T_future帧运动信息，范围是5～20，表示运动摄像头未来打算去往何方，记作E_Fur；A61, adopt the same method as step A41, obtain the T _future frame motion information of the self-moving future of the motion camera through the Structure From Motion algorithm, and the range is 5～20, which indicates that the motion camera is going to where in the future, and is recorded as E _Fur ;

A62、构建摄像头自我运动未来特征提取卷积网络CNN，提取摄像头未来自我运动信息E_Fur的特征，得到摄像头自我运动未来特征向量E_Fur ^F：A62. Construct the camera self-motion future feature extraction convolutional network CNN, extract the features of the camera's future self-motion information E _Fur , and obtain the camera self-motion future feature vector E _Fur ^F :

E_Fur ^F＝(eff₁，...，eff_n)E _Fur ^F = (eff ₁ ,..., eff _n )

其中，eff_i表示摄像头自我运动未来特征的第i个特征值。Among them, eff _i represents the i-th eigenvalue of the camera ego-motion future feature.

所述摄像头未来自我运动特征提取卷积网络结构与步骤A42的摄像头自我运动历史特征提取卷积网络结构相同。The structure of the convolutional network for extracting the camera's future ego-motion features is the same as the convolutional network structure for extracting the camera's ego-motion history features in step A42.

B、网络解码器解码预测行人未来轨迹B. The network decoder decodes and predicts the future trajectory of pedestrians

在网络解码器结构中对网络编码器输出的行人位置信息特征和自我运动信息特征进行解码，通过反卷积网络得到行人未来轨迹。为了提高预测精度，加入能代表未来走向的运动摄像头自身的未来运动信息，具体步骤如下：In the network decoder structure, the pedestrian position information features and self-motion information features output by the network encoder are decoded, and the future trajectory of pedestrians is obtained through the deconvolution network. In order to improve the prediction accuracy, the future motion information of the motion camera itself that can represent the future trend is added. The specific steps are as follows:

B1、构建标准循环神经网络RNN，单元数为n；B1. Construct a standard recurrent neural network RNN with n units;

B2、将特征向量LE^F和E_Fur ^F作为循环神经网络RNN的两个输入，得到网络的输出为预测序列L_out。B2. The feature vectors LE ^F and E _Fur ^F are used as two inputs of the recurrent neural network RNN, and the output of the network is obtained as a prediction sequence L _out .

B3、构建解码轨迹序列反卷积网络；B3. Construct a deconvolution network for decoding trajectory sequences;

解码轨迹序列反卷积网络结构采用4层结构，先将预测序列L_out输入第一层Convld一维卷积层，第一层Conv1d一维卷积层输出结果输入第二层Conv1d一维卷积层，第二层Conv1d一维卷积层输出结果输入第三层Conv1d一维卷积层，前三层的输出结果都是先进行BN批量归一化处理并经过Relu激活函数激活。最后把第三层Convld一维卷积层输出结果输入第四层Conv1d一维卷积层。The deconvolution network structure of the decoding trajectory sequence adopts a 4-layer structure. First, the prediction sequence L _out is input into the first layer of Convld one-dimensional convolutional layer, and the output result of the first layer of Conv1d one-dimensional convolutional layer is input into the second layer of Conv1d one-dimensional convolutional layer. layer, the output of the second layer Conv1d one-dimensional convolutional layer is input to the third layer Conv1d one-dimensional convolutional layer, and the output results of the first three layers are first processed by BN batch normalization and activated by the Relu activation function. Finally, the output result of the third layer Convld one-dimensional convolutional layer is input into the fourth layer Conv1d one-dimensional convolutional layer.

B4、对预测序列L_out输入步骤B3构建好的反卷积网络，得到行人未来的检测框大小信息和轨迹信息L_pre：B4. Input the predicted sequence L _out into the deconvolution network constructed in step B3 to obtain the pedestrian's future detection frame size information and trajectory information L _pre :

其中，l_i＝(x_i ^min，y_i ^min，x_i ^max，y_i ^max)∈R⁴，i的取值范围是t_current+1～t_current+T_future，t_current为当前时刻，T_future为预测的未来帧数，范围是5～20。Among them, l _i =( _xi ^min , y _i ^min , _xi ^max , y _i ^max )∈R ⁴ , the value range of i is t _current +1～t _current +T _future , t _current is the current moment, T _future is the number of predicted future frames, ranging from 5 to 20.

与现有技术相比，本发明具有以下有益效果：Compared with the prior art, the present invention has the following beneficial effects:

1、本发明采用编解码结构结合循环卷积网络来预测第一视角下行人轨迹策略。原始图像经过步骤A编码得到的行人轨迹信息的特征向量，在步骤B中进行解码特征向量，预测出未来的行人的轨迹信息。在公共数据集和自己采集到的数据集里，本发明都会准确的预测出多个行人的未来10帧的轨迹信息，最终预测轨迹和最终实际轨迹之间的L2距离误差提高到40，比现有方法提高了30个像素精度。1. The present invention uses a codec structure combined with a circular convolutional network to predict pedestrian trajectory strategies in the first perspective. The eigenvectors of pedestrian trajectory information obtained by encoding the original image in step A are decoded in step B to predict the trajectory information of future pedestrians. In the public data set and the data set collected by itself, the present invention will accurately predict the trajectory information of multiple pedestrians in the next 10 frames, and the L2 distance error between the final predicted trajectory and the final actual trajectory is increased to 40, which is higher than the current There are ways to improve the 30 pixel accuracy.

2、本发明提出了预测行人轨迹的时空卷积循环网络方法，利用一维卷积进行编解码处理，通过时空卷积网络预测，在目前的相关方法中，实现较简单、数据获取和处理清晰、简洁，实用性强。2. The present invention proposes a space-time convolutional network method for predicting pedestrian trajectories, using one-dimensional convolution for encoding and decoding processing, and predicting through the space-time convolutional network. In the current related methods, the implementation is relatively simple, and the data acquisition and processing are clear , Simple and practical.

附图说明Description of drawings

本发明共有附图6张，其中：The present invention has 6 accompanying drawings, wherein:

图1是标记工具标记后的图像。Figure 1 is the image after marking with the marking tool.

图2是标记了t₀时刻行人的历史和未来位置和检测框信息。Figure 2 marks the historical and future positions and detection frame information of pedestrians at time t ₀ .

图3是t₀+10时刻行人的历史和未来位置和检测框信息。Figure 3 shows the historical and future positions and detection frame information of pedestrians at time t ₀ +10.

图4是本发明的流程图。Fig. 4 is a flowchart of the present invention.

图5是本发明中使用的卷积网络结构图。FIG. 5 is a structural diagram of a convolutional network used in the present invention.

图6是本发明中使用的反卷积网络结构图。Fig. 6 is a structure diagram of a deconvolution network used in the present invention.

具体实施方式Detailed ways

下面结合附图对本发明进行进一步地描述。按照图4所示的流程对第一视角图像进行计算，首先用运动摄像头在运动中获取摄像头图像，将n帧图片作为第一视角行人预测的原始图像。按照本发明的步骤A31对原始图像进行处理得到标记后的图像，如图1所示。此处，根据标记工具的精度需要对标记结果进行矫正。The present invention will be further described below in conjunction with the accompanying drawings. Calculate the first-view image according to the process shown in Figure 4. First, use the moving camera to capture the camera image in motion, and use n frames of pictures as the original image for the first-view pedestrian prediction. According to step A31 of the present invention, the original image is processed to obtain the marked image, as shown in FIG. 1 . Here, the marking result needs to be corrected according to the precision of the marking tool.

按照本发明的步骤A、B得到轨迹预测结果。为了直观显示预测效果，将预测轨迹、真实轨迹和历史轨迹均标记到图像上。假定图2是t₀时刻的图像，在图2上用三角形标识标记行人在t₀时刻之后10秒轨迹预测结果、用四角星形标识标记该行人的t₀时刻之后10秒真实轨、用菱形标识标记t₀时刻之前10秒真实历史轨迹，如图2所示。图3是t₀+10时刻的图像。经过对比图2和图3可以看到，图2中三角形标识代表的t₀时刻行人轨迹预测结果和菱形标识代表的未来真实行人轨迹行进趋势一致，并且两条轨迹坐标点偏差值很小。在图3中方框中心是t₀+10时刻的行人所在真实位置中心点，该点已经在图2中t₀时刻上的预测轨迹中预测出，即图2中三角形标识轨迹最左边一个三角形点。分析预测结果可以看到根据本发明的方法能够准确地预测处行人的未来轨迹。According to steps A and B of the present invention, trajectory prediction results are obtained. In order to visually display the prediction effect, the predicted trajectory, real trajectory and historical trajectory are all marked on the image. Assuming that Figure 2 is the image at time t ₀ , in Figure 2, mark the trajectory prediction result of the pedestrian 10 seconds after time t ₀ with a triangle mark, mark the pedestrian’s real trajectory 10 seconds after time t 0 with a four-pointed star mark, and mark the pedestrian’s real trajectory 10 seconds after time t ₀ with a diamond mark Mark the real historical track 10 seconds before time t ₀ , as shown in Figure 2. Figure 3 is the image at time t ₀ +10. By comparing Figure 2 and Figure 3, it can be seen that the predicted pedestrian trajectory at time t ₀ represented by the triangle mark in Figure 2 is consistent with the future real pedestrian trajectory represented by the diamond mark, and the deviation of the coordinate points of the two trajectories is very small. In Figure 3, the center of the box is the center point of the real position of the pedestrian at time t ₀ +10, which has been predicted in the predicted trajectory at time t ₀ in Figure 2, that is, the leftmost triangle point of the triangle mark trajectory in Figure 2 . Analyzing the prediction results, it can be seen that the method according to the present invention can accurately predict the future trajectory of pedestrians.

本发明不局限于本实施例，任何在本发明披露的技术范围内的等同构思或者改变，均列为本发明的保护范围。The present invention is not limited to this embodiment, and any equivalent ideas or changes within the technical scope disclosed in the present invention are listed in the protection scope of the present invention.

Claims

1. A pedestrian trajectory prediction method under a first viewing angle, characterized in that: comprising the following steps:

A. Network encoder encoding to obtain trajectory features

A1. Pedestrians wear or hold motion cameras on their heads to obtain recorded videos from the first perspective in real time;

A2. Divide the video into several images at a frame rate of k frames per second, where k ranges from 5 to 20;

A3, the image divided in step A2 is processed by the processor, and the pedestrian position feature vector is obtained through the following steps:

A31. Use the marking tool to mark the pedestrians in the image, and mark the pedestrian detection frame;

A32. Correct the pedestrian detection frame marked in step A31 through the time window sampling algorithm; since the coordinate origin in the image space is in the upper left corner of the image, the x value of the horizontal axis increases from left to right, and the y value of the vertical axis increases from top to bottom Increments downward, so the position information ( _xi ^min , y _i ^min ) ^T of the upper left corner of the pedestrian detection frame and the position information ( _xi ^max , y _i ^max ) ^T of the lower right corner of the pedestrian detection frame are taken as the pedestrian trajectory data; The trajectory sequence of pedestrians is used as a set of training samples, n ranges from 10 to 20, and the training samples of each pedestrian are recorded as L _in :

Among them, l _i =( _xi ^min , y _i ^min , _xi ^max , y _i ^max )∈R ⁴ , the value range of i is t _current -T _his ～t _current ; t _current is the current moment, and _This represents History frame range, _The value of This is 5 to 20;

A33. Construct a pedestrian position feature extraction convolutional network, and process the pedestrian position and the size of the detection frame to obtain the pedestrian position feature vector L _in ^F :

Lin ^F = ( _lf ₁ , . . . , lf _m ),

Among them, lf _i represents the i-th eigenvalue of the pedestrian position feature;

The pedestrian position feature extraction convolutional network structure adopts a 4-layer structure, and the input data Lin is first input to the first layer of _Convld one-dimensional convolution layer, and the output result of the first layer of Convld one-dimensional convolution layer is input to the second layer of Convld one-dimensional Convolution layer, the output result of the second layer Convld one-dimensional convolution layer is input to the third layer Convld one-dimensional convolution layer, the output result of the third layer Convld one-dimensional convolution layer is input to the fourth layer Convld one-dimensional convolution layer, the fourth layer The output result of the one-dimensional convolutional layer of the layer Convld obtains the feature vector L _in ^F ; the output result of each layer is processed by BN batch normalization and activated by the Relu activation function;

A4. Obtain the camera self-motion history feature vector;

A41. Obtain the camera self-motion information of the current frame image relative to the previous frame image through the Structure From Motion algorithm; the camera self-motion information includes the Euler angle r _t ∈ R ³ of the camera self-rotation information and the velocity information v _t ∈ R ³ ; the Euler angle includes yaw angle ψ, roll angle φ and pitch angle θ, and the speed information includes the projection v _x of the instant speed of the camera on the three-dimensional coordinate axis, v _y , v _z ; the self-motion history characteristics of the camera The vector is denoted as E _H ;

Among them, e _t = (r _t ^T , v _t ^T ) ^T ∈ R ⁶ , the value range of t is t _current - T _his ~ t _current ;

A42. Construct a 4-layer camera self-motion history feature extraction convolutional network, extract the features of the camera self-motion history information _EH , and obtain the camera self-motion history feature ^{vector EHF} _;

E _H ^F = (ef ₁ , . . . , ef _n )

Among them, ef _i represents the i-th eigenvalue of the camera's ego-motion history feature;

The camera self-motion history feature extraction convolutional network structure adopts the same structure as the pedestrian position feature extraction convolutional network;

A5. Connect the two vectors L _in ^F and E _H ^F end to end to get the feature vector LE ^F :

LE ^F = (lf ₁ , . . . , lf _m , ef ₁ , . . . , ef _n )

A6. Obtain the future feature vector of the camera's self-motion;

A61, adopt the same method as step A41, obtain the T _future frame motion information of the self-motion future of the motion camera through the Structure From Motion algorithm, represent the future where the motion camera intends to go, and be denoted as E _Fur ;

A62. Construct the camera self-motion future feature extraction convolutional network CNN, extract the features of the camera's future self-motion information E _Fur , and obtain the camera self-motion future feature vector E _Fur ^F :

E _Fur ^F = (eff ₁ ,..., eff _n )

Among them, eff _i represents the i-th eigenvalue of the camera's self-motion future feature;

The convolutional network structure of the camera's future self-motion feature extraction is the same as that of the camera's self-motion history feature extraction convolutional network structure in step A42;

B. The network decoder decodes and predicts the future trajectory of pedestrians

In the network decoder structure, the pedestrian position information features and self-motion information features output by the network encoder are decoded, and the future trajectory of pedestrians is obtained through the deconvolution network; in order to improve the prediction accuracy, the future of the motion camera itself that can represent the future trend is added Sports information, the specific steps are as follows:

B1. Construct a standard recurrent neural network RNN with n units;

B2, using the feature vectors LE ^F and E _Fur ^F as two inputs of the recurrent neural network RNN, and obtaining the output of the network as the prediction sequence L _out ;

B3. Construct a deconvolution network for decoding trajectory sequences;

The deconvolution network structure of the decoding trajectory sequence adopts a 4-layer structure. First, the prediction sequence L _out is input into the first layer of Convld one-dimensional convolution layer, and the output result of the first layer of Convld one-dimensional convolution layer is input into the second layer of Convld one-dimensional convolution Layer, the output of the second layer Convld one-dimensional convolutional layer is input to the third layer Convld one-dimensional convolutional layer, the output results of the first three layers are first processed by BN batch normalization and activated by the Relu activation function; finally, the second layer The output result of the three-layer Convld one-dimensional convolutional layer is input to the fourth layer Convld one-dimensional convolutional layer;

B4. Input the predicted sequence L _out into the deconvolution network constructed in step B3 to obtain the pedestrian's future detection frame size information and trajectory information L _pre :

Among them, l _i =( _xi ^min , y _i ^min , _xi ^max , y _i ^max )∈R ⁴ , the value range of i is t _current +1～t _current +T _future , t _current is the current moment, T _future is the number of predicted future frames, ranging from 5 to 20.