CN114997067A

CN114997067A - Trajectory prediction method based on space-time diagram and space-domain aggregation Transformer network

Info

Publication number: CN114997067A
Application number: CN202210767796.8A
Authority: CN
Inventors: 曾繁虎; 杨欣; 王翔辰; 李恒锐; 樊江锋; 周大可
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2022-06-30
Filing date: 2022-06-30
Publication date: 2022-09-02
Anticipated expiration: 2042-06-30
Also published as: CN114997067B

Abstract

The invention discloses a trajectory prediction method based on a space-time diagram and airspace aggregation Transformer network, which solves the problem of insufficient extraction of interactive features in the existing pedestrian trajectory prediction output, uses a space-time diagram convolution neural network and a time sequence feature transformation network to operate and finish effective and accurate extraction of pedestrian trajectory features in a scene, meanwhile, a brand-new airspace aggregation Transformer framework is designed to perform pedestrian time sequence characteristic transformation, efficient aggregation and utilization of airspace pedestrian characteristics are completed, output of predicted tracks of pedestrians is finally completed in a probability distribution mode, the purposes of reasonably avoiding sudden conditions and keeping movement consistency of group pedestrians are achieved, relevant indexes show that the framework breaks through in the aspect of predicting pedestrian endpoints, the purpose of more accurate and efficient prediction of pedestrian track distribution is completed, and important help is provided for development in the fields of automatic driving, intelligent traffic and the like.

Description

A Trajectory Prediction Method Based on Spatio-temporal Graph and Spatial Domain Aggregated Transformer Network

技术领域technical field

本发明涉及一种基于时空图与空域聚合Transformer网络的轨迹预测方法，属于人工智能与自动驾驶领域。The invention relates to a trajectory prediction method based on a spatiotemporal map and a spatial domain aggregation Transformer network, and belongs to the field of artificial intelligence and automatic driving.

背景技术Background technique

行人轨迹预测技术具有较深刻的理论背景与实际应用价值，在如无人驾驶，智能监控等领域，行人轨迹识别与预测技术一直占据着较为重要的地位。近年来，由于人工智能和深度学习技术的进步，有关行人轨迹预测问题的智能算法落地与应用逐渐引发了关注与热议。Pedestrian trajectory prediction technology has profound theoretical background and practical application value. In fields such as unmanned driving and intelligent monitoring, pedestrian trajectory recognition and prediction technology has always occupied a relatively important position. In recent years, due to the advancement of artificial intelligence and deep learning technology, the implementation and application of intelligent algorithms related to pedestrian trajectory prediction has gradually attracted attention and heated discussions.

一直以来，使智能体对于场景内交通参与者的行为特点进行更好理解与判断，建立具有空间交互特征信息的行人轨迹预测模型并进行相关预测，进而作出准确快速合理的相关决策一直是行人轨迹预测问题所要达到的目标。然而，行人轨迹预测问题的高度复杂性与不确定性决定其存在以下难点：复杂的场景特征信息使得行人的将来轨迹不仅受到其自身历史轨迹与既定轨迹路线的影响，同时还受到场景内障碍物与其他交通参与者在时空维度的多种影响。因此，能否建立合理准确的模型并进行快速预测输出与决策，是行人轨迹预测问题应用于实际场景的关键。For a long time, it has always been the pedestrian trajectory to enable the agent to better understand and judge the behavior characteristics of the traffic participants in the scene, establish a pedestrian trajectory prediction model with spatial interactive feature information and make relevant predictions, and then make accurate, fast and reasonable relevant decisions. Predict what the problem is trying to achieve. However, the high complexity and uncertainty of the pedestrian trajectory prediction problem determines that it has the following difficulties: the complex scene feature information makes the pedestrian's future trajectory not only affected by its own historical trajectory and the established trajectory route, but also by the obstacles in the scene. Multiple effects with other traffic participants in the spatiotemporal dimension. Therefore, whether a reasonable and accurate model can be established and rapid prediction output and decision-making can be performed is the key to the application of pedestrian trajectory prediction in practical scenarios.

得益于机器学习在人工智能领域的发展，在很长一段时间里，基于LSTM以及基于CNN算法的轨迹预测方法是主流预测方法。这类预测方法具有模型简单，可以使用较少的参数和较基本的模型架构取得相当不错的预测效果，这些架构也为后续深入的算法研究提供了思路与基本模块框架，具有开创性的意义。Thanks to the development of machine learning in the field of artificial intelligence, trajectory prediction methods based on LSTM and CNN algorithms have been the mainstream prediction methods for a long time. This type of prediction method has a simple model, and can achieve fairly good prediction results with fewer parameters and a more basic model architecture. These architectures also provide ideas and basic module frameworks for subsequent in-depth algorithm research, which is of pioneering significance.

由于图及其网络架构在行人轨迹预测问题的数据信息表示方面具有天然优势，基于图的行人轨迹预测研究成为近年来研究的热门方向。Mohamed A等人在2020年的文献(Social-stgcnn:A social spatio-temporal graph convolutional neural networkfor human trajectory prediction[C])中使用时空图神经网络的方法，对时域和空域分别进行两种不同的卷积操作，得到轨迹特征信息的同时进行预测输出。同样，模型考虑行人轨迹在空间内的随机性与不确定性，即模型预测的时候并不事先知道每个行人的既定轨迹与终点等信息，因而一种合理的研究方法便是假设行人预测轨迹的横纵坐标符合二维高斯分布，在检验预测的过程中使用采样的方式输出轨迹。该模型也基于这样的假设完成预测，取得了较为不错的结果。然而，这样的行人轨迹预测模型中仍然存在没有针对行人交互特征信息进行进一步处理，导致空间交互能力不足，使得产生的轨迹惯性较大，也不能依据组群行人之间的运动模式产生紧密关联的运动预测。Since graphs and their network architectures have natural advantages in the representation of data and information for pedestrian trajectory prediction problems, graph-based pedestrian trajectory prediction research has become a hot research direction in recent years. Mohamed A et al. used the method of spatiotemporal graph neural network in the literature of 2020 (Social-stgcnn: A social spatio-temporal graph convolutional neural network for human trajectory prediction [C]), and conducted two different temporal and spatial domains respectively. The convolution operation is used to obtain the trajectory feature information and predict the output at the same time. Similarly, the model considers the randomness and uncertainty of pedestrian trajectories in space, that is, the model does not know the predetermined trajectory and end point of each pedestrian in advance when predicting. Therefore, a reasonable research method is to assume that the pedestrian predicts the trajectory. The abscissa and ordinate of are conformed to a two-dimensional Gaussian distribution, and the trajectory is output by sampling in the process of testing the prediction. The model also completed predictions based on such assumptions, and achieved relatively good results. However, in such pedestrian trajectory prediction model, there is still no further processing of pedestrian interaction feature information, resulting in insufficient spatial interaction ability, resulting in a large trajectory inertia, and it cannot be closely related according to the movement patterns between groups of pedestrians. Motion prediction.

近年来，已经有许多学者基于图表示，结合许多其他不同的算法工具与研究方法对行人轨迹预测进行研究，取得了许多方面的进展。Dan X等人在文献(Spatial-TemporalBlock and LSTM Network for Pedestrian Trajectories Prediction[J])中提出一种基于时空模块和LSTM的行人轨迹预测模型架构，该模型基于图表示，通过图的嵌入表示得到每个行人节点与其邻居行人之间的关系特征向量，输入到LSTM中编码得到的时空图行人交互特征向量，进而进行相关预测，取得了很好的预测结果；Rainbow B A等人在文献(Semantics-STGCNN:A semantics-guided spatial-temporal graph convolutionalnetwork for multi-class trajectory prediction[C])中提出了一种基于语义的时空图行人轨迹预测模型 Semantics-STGCNN,模型中从场景语义理解出发，将行人对象的类别标签嵌入到标签邻接矩阵中，结合速度邻接矩阵，输出语义邻接矩阵，完成对语义信息的建模，最终输出预测结果；Yu C等人在文献(Spatio-temporal graph transformer networksfor pedestrian trajectory prediction[C])中使用一种基于Transformer的网络模型，该模型利用Transformer 在其他领域的优秀表现，设计直接拼接多个Transformer基本框架提取行人在场景内的时空特征信息并完成相关预测。In recent years, many scholars have studied pedestrian trajectory prediction based on graph representation, combined with many other different algorithm tools and research methods, and have made progress in many aspects. In the literature (Spatial-TemporalBlock and LSTM Network for Pedestrian Trajectories Prediction[J]), Dan X et al. proposed a pedestrian trajectory prediction model architecture based on spatiotemporal modules and LSTM. The relationship feature vector between individual pedestrian nodes and their neighbors is input into the spatiotemporal graph pedestrian interaction feature vector encoded in LSTM, and then related predictions are made, and good prediction results are obtained; Rainbow B A et al. in the literature (Semantics-STGCNN) In :A semantics-guided spatial-temporal graph convolutionalnetwork for multi-class trajectory prediction[C]), a semantic-based spatial-temporal graph pedestrian trajectory prediction model Semantics-STGCNN is proposed. The model starts from the scene semantic understanding and combines the pedestrian object's trajectory. The category label is embedded in the label adjacency matrix, combined with the velocity adjacency matrix, the semantic adjacency matrix is output, the modeling of semantic information is completed, and the prediction result is finally output; Yu C et al. in the literature (Spatio-temporal graph transformer networks for pedestrian trajectory prediction[C ]), a Transformer-based network model is used, which uses Transformer's excellent performance in other fields to design and directly splicing multiple Transformer basic frameworks to extract the spatiotemporal feature information of pedestrians in the scene and complete related predictions.

本发明中，针对现有轨迹预测方法在行人空间交互特征提取与预测方面存在的问题和不足，提出了一种使用时空图与空域聚合Transformer进行行人轨迹预测的全新网络架构，对于输入的原始数据进行恰当的图表示与预处理，使用时空图卷积神经网络与时序特征变换网络对原始对行人轨迹特征信息进行提取，并引入一个空域Transformer网络架构深层空间特征信息进行充分提取与聚合，以确保模型在空间行人交互特征方面的有效性与准确性。本发明注重模型预测结果在空间交互方面的合理性，保证行人空间行走特性的同时兼顾交互影响，特别对于行人轨迹终点的预测取得了突破，对于建模复杂场景内的行人轨迹交互与预测行人轨迹有着积极的作用，对无人驾驶、人工智能等领域的研究与探索起到帮助与启发。In the present invention, in view of the problems and deficiencies existing in the existing trajectory prediction methods in the extraction and prediction of pedestrian spatial interaction features, a new network architecture using spatiotemporal map and spatial aggregation Transformer for pedestrian trajectory prediction is proposed. Carry out proper graph representation and preprocessing, use spatiotemporal graph convolutional neural network and time series feature transformation network to extract original pedestrian trajectory feature information, and introduce a spatial Transformer network architecture to fully extract and aggregate deep spatial feature information to ensure that Model effectiveness and accuracy in spatial pedestrian interaction features. The invention pays attention to the rationality of the model prediction results in the aspect of space interaction, ensures the pedestrian space walking characteristics while taking into account the interaction influence, and especially achieves a breakthrough in the prediction of the end point of the pedestrian trajectory, and is used for modeling the pedestrian trajectory interaction in complex scenes and predicting the pedestrian trajectory. It plays a positive role, and helps and inspires research and exploration in areas such as unmanned driving and artificial intelligence.

发明内容SUMMARY OF THE INVENTION

本发明公开一种基于时空图与空域聚合Transformer网络的轨迹预测方法，该方法针对现有行人轨迹预测方法中空间行人轨迹信息提取不充分，导致行人行走时相对位置关系不够清晰、无法针对碰撞等作出较大范围转动等问题，考虑以图的形式建立一种新的轨迹预测模型架构，通过时空图卷积神经网络的以及时序特征变换网络等操作完成对场景内行人特征提取，设计一个全新的空域聚合Transformer架构进行行人时序特征变换与利用，最终以概率分布的形式完成对行人预测轨迹的输出，达到合理预测的目的。The invention discloses a trajectory prediction method based on a spatiotemporal map and a spatial domain aggregated Transformer network. The method aims at the insufficient extraction of spatial pedestrian trajectory information in the existing pedestrian trajectory prediction method, which results in that the relative position relationship of pedestrians is not clear enough when walking, and cannot be used for collisions and the like. To solve problems such as large-scale rotation, consider establishing a new trajectory prediction model architecture in the form of a graph, and complete the feature extraction of pedestrians in the scene through operations such as spatiotemporal graph convolutional neural networks and time series feature transformation networks, and design a new The spatial aggregation Transformer architecture performs the transformation and utilization of pedestrian time series features, and finally completes the output of pedestrian predicted trajectories in the form of probability distribution to achieve the purpose of reasonable prediction.

在时空图卷积神经网络方面，将场景内行人轨迹特征信息通过图的形式进行表示与预处理，构建图卷积神经网络完成对空间内行人轨迹特征信息的初步提取，作为后续网络输入。In terms of spatiotemporal graph convolutional neural network, the pedestrian trajectory feature information in the scene is represented and preprocessed in the form of a graph, and a graph convolutional neural network is constructed to complete the preliminary extraction of pedestrian trajectory feature information in space as the subsequent network input.

在时序特征变换网络中，通过一个卷积伸进网络完成时序特征信息的提取与特征维度的变换，同时合理设计网络以简化模型参数，提高模型性能。In the time series feature transformation network, the extraction of time series feature information and the transformation of feature dimension are completed through a convolution extending into the network, and the network is reasonably designed to simplify the model parameters and improve the model performance.

在空域聚合Transformer网络中，对于先前从时空图卷积神经网络以及时序特征变换网络中得到的特征进一步处理。为了对空间场景内行人特征的交互进行进一步挖掘与建模，本发明中的模型使用每个行人的时序特征向量作为输入向量，输入至空域聚合Transformer网络中对行人空间轨迹特征进行充分提取与聚合，同时完成轨迹预测输出的任务。In the spatial aggregation Transformer network, the features previously obtained from the spatiotemporal graph convolutional neural network and the temporal feature transformation network are further processed. In order to further mine and model the interaction of pedestrian features in the spatial scene, the model in the present invention uses the time series feature vector of each pedestrian as an input vector, and inputs it into the spatial aggregation Transformer network to fully extract and aggregate the pedestrian spatial trajectory features. , while completing the task of trajectory prediction output.

本发明主要包括以下步骤：The present invention mainly comprises the following steps:

步骤(1)：利用图的特性从输入的原始数据中对场景内行人轨迹特征信息进行图表示与预处理，选取合适的核函数完成对邻接矩阵的构建，为后续网络架构输入提供准确、高效的场景内行人信息；Step (1): Use the characteristics of the graph to represent and preprocess the pedestrian trajectory feature information in the scene from the input original data, select the appropriate kernel function to complete the construction of the adjacency matrix, and provide accurate and efficient input for the subsequent network architecture. information of pedestrians in the scene;

步骤(2)：建立时空图卷积神经网络模块，构建图卷积神经网络，通过选择对行人轨迹特征的图卷积次数完成对空间内行人轨迹特征信息的初步提取，确保提取特征的准确、有效；Step (2): Establish a spatiotemporal graph convolutional neural network module, construct a graph convolutional neural network, and complete the preliminary extraction of pedestrian trajectory feature information in space by selecting the number of graph convolutions for pedestrian trajectory features to ensure accurate and accurate extraction of features. efficient;

步骤(3)：建立时序特征变换网络模块，通过设计卷积神经网络完成时序特征的提取与特征维度的变换；Step (3): establish a time series feature transformation network module, and complete the extraction of time series features and the transformation of feature dimensions by designing a convolutional neural network;

步骤(4)：建立空域聚合Transformer网络，使用场景内每个行人的时序特征向量作为输入向量，同时输入Transformer网络进行空域特征的进一步聚合，并且完成行人轨迹预测序列的输出。Step (4): Establish an airspace aggregation Transformer network, use the time series feature vector of each pedestrian in the scene as an input vector, and input the Transformer network to further aggregate airspace features, and complete the output of the pedestrian trajectory prediction sequence.

进一步的，所述步骤(1)中，引入时空图对输入的原始行人轨迹数据进行图表示，从多种核函数中选择合适的核函数构建图意义下的邻接矩阵，完成高效的场景内行人特征构建与选择，为后续建模提供准确、高效的信息。Further, in the step (1), a spatiotemporal graph is introduced to represent the input original pedestrian trajectory data, and an appropriate kernel function is selected from a variety of kernel functions to construct an adjacency matrix in the sense of graph, so as to complete an efficient pedestrian in the scene. Feature construction and selection provide accurate and efficient information for subsequent modeling.

进一步的，所述引入时空图对输入的原始行人轨迹数据进行图表示具体为：对于每个时刻t，引入一个空间图G_t，用来表示每个时间点行人间的交互特征关系；G_t定义为 G_t＝(V_t，E_t)，其中，V_t具体表示时刻t场景内行人的坐标信息，即

每个

的特征信息使用观测的相对坐标变化

来进行刻画，即：Further, the introduction of a space-time map to graphically represent the input original pedestrian trajectory data is specifically: for each time t, a space map G _t is introduced to represent the interactive feature relationship between pedestrians at each time point; G _t Defined as G _t =(V _t , E _t ), where V _t specifically represents the coordinate information of pedestrians in the scene at time t, namely

each

The feature information of using the observed relative coordinate changes

to characterize, that is:

其中，i＝1,…,N，t＝2,…,T_obs，对于初始时刻，规定其位置相对偏移为0，即

Among them, i=1,...,N, t=2,...,T _obs , for the initial moment, the relative position offset is defined as 0, that is,

E_t则表示空间图G_t的边信息，其是一个维度大小为n×n的矩阵；定义为

的取值由如下方式给出：E _t represents the side information of the spatial graph G _t , which is a matrix with a dimension of n×n; it is defined as

The value of is given by:

如果节点

与节点

相连，那么

反之，如果节点

与节点

不相连，那么

if node

with node

connected, then

Conversely, if the node

with node

not connected, then

进一步的，所述从多种核函数中选择合适的核函数构建图意义下的邻接矩阵具体为：Further, the adjacency matrix in the sense of selecting a suitable kernel function from a variety of kernel functions to construct a graph is specifically:

引入加权邻接矩阵A_t对行人空间图的节点信息进行加权表示，通过核函数变换得到行人间相互影响的大小并存储在加权邻接矩阵A_t中；The weighted adjacency matrix A _t is introduced to represent the node information of the pedestrian spatial graph by weight, and the magnitude of the mutual influence between pedestrians is obtained through the kernel function transformation and stored in the weighted adjacency matrix A _t ;

选用两个节点在欧式空间中距离的倒数作为核函数，并且为了避免二者过于接近而导致的函数发散问题，加入一个微小的常量ε来加速模型收敛，表达式如下：The inverse of the distance between two nodes in Euclidean space is used as the kernel function, and in order to avoid the function divergence problem caused by the two being too close, a tiny constant ε is added to accelerate the model convergence, the expression is as follows:

在时间维度上对于每一个时刻的空间图G_t进行堆叠，即得到图表示下的行人轨迹预测时空图序列G＝{G₁,…,G_T}。In the time dimension, the spatial graph G _t at each moment is stacked, that is, the pedestrian trajectory prediction spatio-temporal graph sequence G={G ₁ ,...,G _T } under the graph representation is obtained.

进一步的，所述步骤(2)具体为：Further, the step (2) is specifically:

对于输入得到的特征图时间序列，通过建立的时空图卷积神经网络得到输出：For the input feature map time series, the output is obtained through the established spatiotemporal map convolutional neural network:

e_t＝GNN(G_t) (1.6)e _t =GNN(G _t ) (1.6)

其中，GNN表示构建的时空图卷积神经网络，其由多层的图卷积迭代得到输出结果；e_t表示通过图神经网络从空间维度初步提取的时空特征信息；Among them, GNN represents the constructed spatiotemporal graph convolutional neural network, which obtains the output result by iterative multi-layer graph convolution; e _t represents the spatiotemporal feature information initially extracted from the spatial dimension through the graph neural network;

对于每一个时刻的输出，均有这样的操作；而实际图卷积神经网络得到的输出则是这样时间序列的堆叠：For the output of each moment, there is such an operation; and the output obtained by the actual graph convolutional neural network is a stack of such time series:

e_g＝Stack(e_t) (1.7)e _g = Stack(e _t ) (1.7)

其中，Stack(·)表示对于输入在拓展维度上的叠加，e_g表示图卷积的输出；实际处理过程中，多个拓展维度是同时并行送入图神经网络进行处理的；Among them, Stack( ) represents the superposition of the input on the extended dimension, and e _g represents the output of the graph convolution; in the actual processing process, multiple extended dimensions are simultaneously sent to the graph neural network for processing in parallel;

接着经过一个全连接层FC对特征进行恰当的维度变换：Then, through a fully connected layer FC, the features are appropriately dimensionally transformed:

V_GNN＝FC(e_g) (1.8)V _GNN = FC(e _g ) (1.8)

由此得到时空图卷积神经网络的特征信息的初步提取输出。Thereby, the preliminary extraction output of the feature information of the spatiotemporal graph convolutional neural network is obtained.

进一步的，所述步骤(3)中，将时空图卷积神经网络的输出经过维度变换，使用一个基于CNN的时序特征变换网络模块并设计卷积次数完成对行人自身历史轨迹特征信息的提取；Further, in the step (3), the output of the spatiotemporal graph convolutional neural network is subjected to dimensional transformation, and a CNN-based time series feature transformation network module is used and the number of convolutions is designed to complete the extraction of the pedestrian's own historical trajectory feature information;

进一步的，所述步骤(3)具体为：Further, the step (3) is specifically:

在得到时空图卷积神经网络的特征提取信息后，送入一个时序特征变换网络对时序特征进行提取；由于在步骤二中已经通过一个全连接层对于维度特征进行合适变换，因此本步骤中的网络模块直接对得到的特征信息进行利用；本发明中，选择多层CNN卷积神经网络对时间维度特征信息进行处理，可以表示为：After obtaining the feature extraction information of the spatiotemporal graph convolutional neural network, it is sent to a time series feature transformation network to extract the time series features; since the dimension features have been appropriately transformed through a fully connected layer in step 2, the The network module directly utilizes the obtained feature information; in the present invention, the multi-layer CNN convolutional neural network is selected to process the time dimension feature information, which can be expressed as:

e_c＝CNN(V_GNN) (1.9) _ec = CNN(V _GNN ) (1.9)

其中，V_GNN表示从图卷积神经网络中提取到的特征信息，e_c表示经过时序特征变换网络的输出；接着通过一个多层感知机MLP，用以增加网络的表达能力：Among them, V _GNN represents the feature information extracted from the graph convolutional neural network, and e _c represents the output of the time series feature transformation network; then a multi-layer perceptron MLP is used to increase the expressive ability of the network:

V_CNN＝MLP(e_c) (1.10)V _CNN = MLP( _ec ) (1.10)

通过上述网络进行特征的变换与处理，即得到时序特征变换网络的输出V_CNN。The feature transformation and processing are performed through the above network, that is, the output _VCNN of the time series feature transformation network is obtained.

进一步的，步骤四的主要构建计算内容包括：为了增加行人特征在空域之间的联系，设计一个空域Transformer网络对上述提取到的特征信息进行进一步空间聚合。特别地，将同一个行人在时序上的特征向量作为输入向量输入，依次输入的为不同行人的提取特征。Further, the main construction and calculation content of step 4 includes: in order to increase the connection between pedestrian features in the airspace, an airspace Transformer network is designed to further spatially aggregate the above extracted feature information. In particular, the feature vector of the same pedestrian in the time series is input as the input vector, and the extracted features of different pedestrians are input in sequence.

对于空域聚合Transformer网络，选用Transformer架构的编码器层，首先对输入添加位置编码：For the spatial aggregation Transformer network, the encoder layer of the Transformer architecture is selected, and position encoding is first added to the input:

V_in＝V_CNN+PE_pos,i(V_CNN) (1.11)V _in = V _CNN +PE _pos,i (V _CNN ) (1.11)

其中pos表示输入特征的相对位置，i表示输入特征的维度。接着引入多头注意力层，使用从输入层进行矩阵变换得到的三个注意力层输入Query(Q)、Key(K)、Value (V)，依照设定的多头数对输入特征进行划分，计算注意力得分，表达式如下：where pos represents the relative position of the input feature and i represents the dimension of the input feature. Then introduce a multi-head attention layer, use the three attention layers obtained by matrix transformation from the input layer to input Query(Q), Key(K), Value (V), divide the input features according to the set number of heads, and calculate The attention score, expressed as follows:

head_i＝Attention(Q_i,K_i,V_i) (1.13)head _i =Attention(Q _i ,K _i ,V _i ) (1.13)

其中，i＝1,…,nhead，nhead表示多头数。而最终的多头输出通过拼接的方式完成特征提取，表达式如下所示：Among them, i=1,...,nhead, nhead represents the number of long heads. The final multi-head output completes feature extraction by splicing, and the expression is as follows:

V_Multi＝ConCat(head₁,…,head_h)W_o (1.14)V _Multi = ConCat(head ₁ ,...,head _h )W _o (1.14)

其中，ConCat表示拼接操作，W_o表示注意力层输出的参数矩阵。Among them, _ConCat represents the concatenation operation, and Wo represents the parameter matrix output by the attention layer.

接着通过前馈神经网络以及层归一化完成空域Transformer的最终输出，表示为：Then, the final output of the spatial Transformer is completed through the feedforward neural network and layer normalization, which is expressed as:

V_out＝LN(Feedback(V_Multi)) (1.15)V _out = LN(Feedback(V _Multi )) (1.15)

通过这种架构方式，较好地完成堆通过初步提取的时空特征进行行人空间交互特征的聚合，达到更好输出符合场景行人关联与交互的行人轨迹的目的。Through this architectural method, the aggregation of pedestrian spatial interaction features through the initially extracted spatiotemporal features can be better completed, and the goal of better outputting pedestrian trajectories that conform to the pedestrian association and interaction in the scene is achieved.

在损失函数方面，选用行人预测轨迹上每一点的负对数似然之和作为损失函数。第 i个行人的损失函数有如下表示：In terms of loss function, the sum of negative log-likelihoods of each point on the pedestrian predicted trajectory is selected as the loss function. The loss function of the i-th pedestrian is expressed as follows:

其中，

是待预测的未知的行人轨迹特征参数，T_obs，T_pred分别表示观测和预测终点时刻；而所有行人的损失函数之和即为最终的损失函数：in,

is the unknown pedestrian trajectory feature parameter to be predicted, T _obs , T _pred represent the end point of observation and prediction respectively; and the sum of the loss functions of all pedestrians is the final loss function:

通过对本发明提出的上述模型架构进行正向损失函数计算和反向参数更新，即可完成对模型的训练，得到合理的行人预测轨迹输出。By performing forward loss function calculation and reverse parameter update on the above-mentioned model architecture proposed by the present invention, the training of the model can be completed, and a reasonable pedestrian predicted trajectory output can be obtained.

有益效果beneficial effect

为了解决现有行人轨迹预测输出存在的交互特征提取不足、进而导致的行人空间特性不明显，一方面表现在行人预测轨迹多存在较大惯性，不能针对高速、突发等状况进行较大转角的避让，另一方面表现在行人组群行为的运动一致性保持不够，导致空间内关联紧密的人群之间不能在一段时间内保持相同的运动趋势的问题，本发明提出一种全新的网络模型架构，使用时空图卷积神经网络以及时序特征变换网络等相关变换操作完成对场景内行人特征的有效、准确提取，同时设计一个全新的空域聚合Transformer架构进行行人时序特征变换与利用，最终以概率分布的形式完成对行人预测轨迹的输出，达到对突发状况进行合理避让、保持组群行人运动一致性的目的，完成对行人空间交互的更准确、合理预测，对于行人轨迹预测问题的进一步深入研究和探索提供了新的思路，为其在实际场景下更准确、及时的预测与应用具有深刻的意义和作用，对于自动驾驶、智慧交通等领域的发展提供了帮助。In order to solve the problem of insufficient interactive feature extraction in the existing pedestrian trajectory prediction output, which leads to the inconspicuous pedestrian spatial characteristics, on the one hand, the pedestrian prediction trajectory has a large inertia, and it cannot be used for high-speed, sudden and other situations. Avoidance, on the other hand, is manifested in the problem that the movement consistency of pedestrian group behaviors is not maintained enough, resulting in the problem that the closely related people in the space cannot maintain the same movement trend for a period of time. The present invention proposes a new network model architecture. , using spatiotemporal graph convolutional neural network and time series feature transformation network and other related transformation operations to complete the effective and accurate extraction of pedestrian features in the scene, and design a new spatial aggregation Transformer architecture to transform and utilize pedestrian time series features, and finally use probability distribution It completes the output of pedestrian predicted trajectories in the form of a reasonable avoidance of emergencies, maintains the consistency of group pedestrian movements, completes more accurate and reasonable prediction of pedestrian space interaction, and conducts further in-depth research on pedestrian trajectory prediction. It provides new ideas for more accurate and timely prediction and application in actual scenarios, and provides help for the development of autonomous driving, intelligent transportation and other fields.

附图说明Description of drawings

图1为本发明具体实施方式中基于时空图与空域聚合Transformer网络框架的整体示意图；Fig. 1 is the overall schematic diagram of the Transformer network framework based on the spatiotemporal map and the spatial domain aggregation in the specific embodiment of the present invention;

图2为本发明中利用时序变换特征输入空域聚合Transformer网络进行轨迹预测的示意图。FIG. 2 is a schematic diagram of using time series transformation features to input spatial aggregation Transformer network to perform trajectory prediction in the present invention.

具体实施方式Detailed ways

本发明涉及一种基于时空图与空域聚合Transformer网络的行人轨迹预测方法，具体实施方式主要包含以下几个步骤：The present invention relates to a pedestrian trajectory prediction method based on a spatiotemporal map and a spatial domain aggregation Transformer network. The specific implementation mainly includes the following steps:

对于给定场景下的行人轨迹预测问题，由N个行人在每个观测时刻在场景内的坐标组成。对于第t个时刻的第i个行人的坐标信息，用

表示。有了如上定义，那么本问题的一般表述为，对每一组已知的给定观测行人轨迹序列：For the pedestrian trajectory prediction problem in a given scene, it consists of the coordinates of N pedestrians in the scene at each observation moment. For the coordinate information of the i-th pedestrian at the t-th time, use

express. With the above definition, the general formulation of this problem is, for each set of known sequences of observed pedestrian trajectories:

由构建网络框架通过输入数据对行人轨迹特性进行提取与建模，得到合适的轨迹特征信息，并给出场景内合理的轨迹预测输出：By constructing a network framework, the pedestrian trajectory characteristics are extracted and modeled through the input data to obtain appropriate trajectory characteristic information, and a reasonable trajectory prediction output in the scene is given:

其中T_obs和T_pred分别表示行人观测时间跨度和预测时间跨度，(·)表示行人轨迹预测真值，

表示模型给出的行人轨迹预测值。where T _obs and T _pred represent the pedestrian observation time span and prediction time span, respectively, ( ) represents the true value of pedestrian trajectory prediction,

Represents the predicted pedestrian trajectory given by the model.

本发明具体实施方式中基于时空图与空域聚合Transformer网络框架的整体示意图如图1所示。The overall schematic diagram of the Transformer network framework based on the spatiotemporal map and the spatial domain aggregation in the specific embodiment of the present invention is shown in FIG. 1 .

步骤一：对数据进行恰当的图表示与预处理，提供准确、高效的场景内行人信息Step 1: Appropriate graph representation and preprocessing of the data to provide accurate and efficient pedestrian information in the scene

本发明中，首先使用恰当的图表示方法对输入的原始行人轨迹数据进行相关图转化与预处理，方便后续中对于输入特征信息进行提取与高效利用。In the present invention, an appropriate graph representation method is used to first perform correlation graph transformation and preprocessing on the input original pedestrian trajectory data, so as to facilitate subsequent extraction and efficient use of input feature information.

对于每个时刻t，引入一个空间图G_t，用来表示每个时间点行人间的交互特征关系。 G_t定义为G_t＝(V_t，E_t)，其中，V_t表示空间图G_t的节点信息，本模型中，V_t具体表示时刻t场景内行人的坐标信息，即

对于本模型，每个

的特征信息使用观测的相对坐标变化

来进行刻画，即：For each time t, a spatial graph G _t is introduced to represent the interactive feature relationship between pedestrians at each time point. G _t is defined as G _t =(V _t , E _t ), where V _t represents the node information of the spatial graph G _t . In this model, V _t specifically represents the coordinate information of pedestrians in the scene at time t, that is,

For this model, each

The feature information of using the observed relative coordinate changes

to characterize, that is:

E_t则表示空间图G_t的边信息，其是一个维度大小为n×n的矩阵。其通常意义上定义为

的取值由如下方式给出：如果节点

与节点

相连，那么

反之，如果节点

与节点

不相连，那么

E _t represents the side information of the spatial graph G _t , which is a matrix with a dimension of n×n. It is usually defined as

The value of is given by: if the node

with node

connected, then

Conversely, if the node

with node

not connected, then

对于本预测任务而言，不仅希望得到行人之间是否关联，还希望度量空间内行人间相互影响的相对大小，因此引入加权邻接矩阵A_t对行人空间图的节点信息进行加权表示，通过核函数变换得到行人间相互影响的大小并存储在加权邻接矩阵A_t中，本发明中，选用两个节点在欧式空间中距离的倒数作为核函数，并且为了避免二者过于接近而导致的函数发散问题，加入一个微小的常量ε来加速模型收敛，表达式如下：For this prediction task, it is not only hoped to obtain whether the pedestrians are related, but also to measure the relative size of the mutual influence between pedestrians in the space. Therefore, a weighted adjacency matrix A _t is introduced to weight the node information of the pedestrian space graph, and transform it through a kernel function. The size of the mutual influence between pedestrians is obtained and stored in the weighted adjacency matrix A _t . In the present invention, the inverse of the distance between the two nodes in the Euclidean space is selected as the kernel function, and in order to avoid the function divergence problem caused by the two being too close, A tiny constant ε is added to speed up the model convergence, and the expression is as follows:

在时间维度上对于每一个时刻的空间图G_t进行堆叠，即得到图表示下的行人轨迹预测时空图序列G＝{G₁,…,G_T}。通过这种定义与变换，完成行人轨迹预测问题中数据的图表示和预处理。In the time dimension, the spatial graph G _t at each moment is stacked, that is, the pedestrian trajectory prediction spatio-temporal graph sequence G={G ₁ ,...,G _T } under the graph representation is obtained. Through this definition and transformation, the graph representation and preprocessing of the data in the pedestrian trajectory prediction problem are completed.

步骤二：建立时空图卷积神经网络对特征信息进行初步提取Step 2: Establish a spatiotemporal graph convolutional neural network to initially extract feature information

本发明中，针对步骤一中对原始数据进行图表示后的数据，使用时空图卷积神经网络对特征信息进行初步提取。In the present invention, the feature information is preliminarily extracted by using a spatiotemporal graph convolutional neural network for the data after the original data is represented graphically in step 1.

该模型架构中，使用图卷积神经网络，确定恰当的卷积层数进行合适的特征迭代次数，达到较好提取空间内轨迹特征的目的。In this model architecture, the graph convolutional neural network is used to determine the appropriate number of convolutional layers and perform appropriate feature iterations to achieve the purpose of better extraction of trajectory features in space.

e_t＝GNN(G_t) (1.6)e _t =GNN(G _t ) (1.6)

其中，GNN表示构建的时空图卷积神经网络，其由多层的图卷积迭代得到输出结果；e_t表示通过图神经网络从空间维度初步提取的时空特征信息。Among them, GNN represents the constructed spatiotemporal graph convolutional neural network, which is iterated by multiple layers of graph convolution to obtain the output result; e _t represents the spatiotemporal feature information initially extracted from the spatial dimension through the graph neural network.

对于每一个时刻的输出，均有这样的操作。而实际图卷积神经网络得到的输出则是这样时间序列的堆叠：For the output at each moment, there is such an operation. The output obtained by the actual graph convolutional neural network is a stack of such time series:

e_g＝Stack(e_t) (1.7)e _g = Stack(e _t ) (1.7)

其中，Stack(·)表示对于输入在拓展维度上的叠加，e_g表示图卷积的输出。实际处理过程中，多个拓展维度是同时并行送入图神经网络进行处理的。Among them, Stack( ) represents the superposition of the input in the extended dimension, and e _g represents the output of the graph convolution. In the actual processing process, multiple extended dimensions are simultaneously sent to the graph neural network for processing in parallel.

V_GNN＝FC(e_g) (1.8)V _GNN = FC(e _g ) (1.8)

步骤三：建立时序特征变换网络，通过设计卷积神经网络完成时序特征的提取与特征维度的变换；Step 3: Establish a time series feature transformation network, and complete the time series feature extraction and feature dimension transformation by designing a convolutional neural network;

在得到时空图卷积神经网络的特征提取信息后，送入一个时序特征变换网络对时序特征进行提取。由于在步骤二中已经通过一个全连接层对于维度特征进行合适变换，因此本步骤中的网络模块直接对得到的特征信息进行利用。本发明中，选择多层CNN卷积神经网络对时间维度特征信息进行处理，可以表示为：After obtaining the feature extraction information of the spatiotemporal graph convolutional neural network, it is sent to a time series feature transformation network to extract the time series features. Since the dimension feature has been appropriately transformed by a fully connected layer in step 2, the network module in this step directly utilizes the obtained feature information. In the present invention, the multi-layer CNN convolutional neural network is selected to process the time dimension feature information, which can be expressed as:

e_c＝CNN(V_GNN) (1.9) _ec = CNN(V _GNN ) (1.9)

其中，V_GNN表示从图卷积神经网络中提取到的特征信息，e_c表示经过时序特征变换网络的输出。接着通过一个多层感知机MLP，用以增加网络的表达能力：Among them, V _GNN represents the feature information extracted from the graph convolutional neural network, and _ec represents the output of the time series feature transformation network. Then pass a multi-layer perceptron MLP to increase the expressive ability of the network:

V_CNN＝MLP(e_c) (1.10)V _CNN = MLP( _ec ) (1.10)

步骤四：建立空域聚合Transformer网络进行空域特征的进一步聚合，并且完成行人轨迹预测序列的输出Step 4: Establish an airspace aggregation Transformer network for further aggregation of airspace features, and complete the output of the pedestrian trajectory prediction sequence

为了解决现有行人轨迹预测输出存在的交互特征提取不足、进而导致的行人空间特性不明显，一方面表现在行人预测轨迹多存在较大惯性，不能针对高速、突发等状况进行较大转角的避让，另一方面表现在行人组群行为的运动一致性保持不够，导致空间内关联紧密的人群之间不能在一段时间内保持相同的运动趋势。In order to solve the problem of insufficient interactive feature extraction in the existing pedestrian trajectory prediction output, which leads to the inconspicuous pedestrian spatial characteristics, on the one hand, the pedestrian prediction trajectory has a large inertia, and it cannot be used for high-speed, sudden and other situations. Avoidance, on the other hand, is manifested in that the movement consistency of pedestrian group behavior is not maintained enough, resulting in the inability of closely related groups in space to maintain the same movement trend for a period of time.

本发明中为了增加行人特征在空域之间的联系，设计一个空域Transformer网络对上述提取到的特征信息进行进一步空间聚合。特别地，将同一个行人在时序上的特征向量作为输入向量输入，依次输入的为不同行人的提取特征。In the present invention, in order to increase the connection between pedestrian features in the airspace, an airspace Transformer network is designed to further spatially aggregate the above extracted feature information. In particular, the feature vector of the same pedestrian in the time series is input as the input vector, and the extracted features of different pedestrians are input in sequence.

V_in＝V_CNN+PE_pos,i(V_CNN) (1.11)V _in = V _CNN +PE _pos,i (V _CNN ) (1.11)

V_out＝LN(Feedback(V_Multi)) (1.15)V _out = LN(Feedback(V _Multi )) (1.15)

其中，

在模型准确性与有效性的评估过程中，与常用的轨迹预测评估方法类似，选用平均偏移误差(Average Differential Error，ADE)和终点偏移误差(Final DifferentialError，FDE) 作为评价指标来描述预测轨迹的准确性。平均位移误差指的是场景中每个行人在每一时刻预测位移与真实位移误差的L2范数的平均值，而终点位移误差指的是场景中每个行人终点时刻预测位移与真实位移误差的L2范数的平均值，其表达式如下：In the process of evaluating the accuracy and validity of the model, similar to the commonly used trajectory prediction evaluation methods, the average offset error (Average Differential Error, ADE) and the end point offset error (Final Differential Error, FDE) are used as evaluation indicators to describe the prediction. accuracy of the trajectory. The average displacement error refers to the average value of the L2 norm of the predicted displacement and the actual displacement error of each pedestrian in the scene at each moment, and the end point displacement error refers to the difference between the predicted displacement and the actual displacement error of each pedestrian in the scene at the end point. The average value of the L2 norm, which is expressed as:

其中，

表示预测行人待预测的轨迹真值，

表示该模型的输出的行人预测轨迹； T_pred表示预测终点时刻，T_p表示预测时间范围，对于FDE指标，仅对于场景内每个行人终点坐标的误差取平均而对于行走路线的选择没有较高要求，而对于ADE指标，则需要对每个时刻点的坐标误差求和取平均。对两种指标而言，值越小表明与实际轨迹越接近，预测性能也越好。in,

represents the true value of the predicted pedestrian trajectory to be predicted,

Represents the pedestrian predicted trajectory of the output of the model; T _pred represents the predicted end point time, T _p represents the prediction time range, for the FDE indicator, only the error of the end point coordinates of each pedestrian in the scene is averaged and the selection of the walking route is not high requirements, and for the ADE index, the sum of the coordinate errors at each time point needs to be averaged. For both metrics, the smaller the value, the closer it is to the actual trajectory and the better the prediction performance.

由于实际输出的为轨迹在二维平面内的概率分布，在实际轨迹预测性能评估时，为保证轨迹多样性和泛化能力，常采用多次采样预测(如20次)，并取最接近真值轨迹的预测轨迹作为输出轨迹的方式来计算ADE/FDE与评估模型。具体而言，对于ETH和UCY 上的五个数据集，每隔0.4秒进行一次采样的方式产生使用行人轨迹数据，每20帧作为一个数据样本，通过给定过去8帧共3.2s的行人轨迹数据作为输入，并预测未来12帧共4.8s的行人轨迹的方式对模型进行训练与验证。将本发明中的模型与其他两种同样使用图网络模型的算法进行性能比较，得到的比较结果如表1所示，最佳性能用红色标出：Since the actual output is the probability distribution of the trajectory in the two-dimensional plane, when evaluating the actual trajectory prediction performance, in order to ensure the trajectory diversity and generalization ability, multiple sampling predictions (such as 20 times) are often used, and the closest to the true trajectory is used for prediction. The predicted trajectories of the value trajectories are used as output trajectories to calculate ADE/FDE and evaluate the model. Specifically, for the five datasets on ETH and UCY, the pedestrian trajectory data is generated by sampling every 0.4 seconds, and every 20 frames is used as a data sample, by giving the pedestrian trajectory of the past 8 frames for a total of 3.2 seconds. The data is used as input, and the model is trained and verified by predicting pedestrian trajectories in the next 12 frames for a total of 4.8s. The performance of the model in the present invention is compared with other two algorithms that also use the graph network model, and the obtained comparison results are shown in Table 1, and the best performance is marked in red:

表1本模型与图网络主流模型预测结果比较Table 1 Comparison of prediction results between this model and mainstream models of graph networks

由表1可以看出，本发明所提出的框架对于终点预测问题起到很大的突破，在几乎所有数据集上FDE指标均为最优，同时平均ADE和FDE指标也均为最优性能。相较于两种图网络算法的最优性能，本模型分别在ETH，UNIV，ZARA1，ZARA2上提升 FDE达17％，21％，5％，12％，在平均FDE指标上提升达16％。由以上数据可以看出，本模型使用一个空域聚合Transformer架构对于将行人时序特征向量输入，专注于对时空图神经网络和时序特征变换网络中提取特征的利用，完成了对空间行人交互特征的更好聚合，达到了更好的预测效果，在FDE上取得了较大的突破，对于行人在空间内的交互特征有了更强的感知与表达。It can be seen from Table 1 that the framework proposed by the present invention has made a great breakthrough in the end point prediction problem, and the FDE index is the best in almost all data sets, and the average ADE and FDE indicators are also the best performance. Compared with the optimal performance of the two graph network algorithms, this model improves the FDE by 17%, 21%, 5%, and 12% on ETH, UNIV, ZARA1, and ZARA2 respectively, and the average FDE index improves by 16%. It can be seen from the above data that this model uses a spatial aggregation Transformer architecture to input the pedestrian time series feature vector, focusing on the utilization of the extracted features from the spatiotemporal graph neural network and the time series feature transformation network, and completes the spatial pedestrian interaction feature. Good aggregation, achieves better prediction effect, and has made great breakthroughs in FDE, and has stronger perception and expression of pedestrian interaction characteristics in space.

Claims

1. a trajectory prediction method based on spatiotemporal map and spatial domain aggregation Transformer network, is characterized in that, comprises the following steps:

(1) Use the characteristics of the graph to represent and preprocess the pedestrian trajectory feature information in the scene from the input original data, select the appropriate kernel function to complete the construction of the adjacency matrix, and provide an accurate and efficient scene for the subsequent network architecture input Insider trajectory feature information;

(2) Establish a spatiotemporal graph convolutional neural network module, and construct a graph convolutional neural network. By selecting the number of graph convolutions for pedestrian trajectory features, the preliminary representation of the graph in step (1) and the preprocessed pedestrian trajectory feature information are completed. Extraction to ensure that the extracted features are accurate and effective;

(3) Establish a time series feature transformation network module, and complete the extraction of time series feature information and the transformation of feature dimensions by designing a convolutional neural network;

(4) Establish a spatial aggregation Transformer network, use the time series feature vector of each pedestrian in the scene as the input vector, and input the Transformer network to further aggregate spatial features, and complete the output of the pedestrian trajectory prediction sequence.

2. a kind of trajectory prediction method based on spatiotemporal map and spatial domain aggregation Transformer network as claimed in claim 1, it is characterized in that, in described step (1), introduce spatiotemporal map to the original pedestrian trajectory data of input to carry out graph representation, Select the appropriate kernel function from a variety of kernel functions to construct an adjacency matrix in the sense of graph, complete the efficient construction and selection of pedestrian features in the scene, and provide accurate and efficient information for subsequent modeling.

3. a kind of trajectory prediction method based on spatiotemporal map and spatial domain aggregation Transformer network as claimed in claim 2, it is characterized in that, described introducing spatiotemporal map to the original pedestrian trajectory data of input carries out graph representation and is specifically: for each moment t, a spatial graph G _t is introduced to represent the interactive feature relationship between pedestrians at each time point; G _t is defined as G _t =(V _t , E _t ), where V _t specifically represents the time t of pedestrians in the scene. coordinate information, i.e.

each

The feature information of using the observed relative coordinate changes

to characterize, that is:

E _t represents the side information of the spatial graph G _t , which is a matrix with a dimension of n×n; it is defined as

The value of is given by:

if node

with node

connected, then

Conversely, if the node

with node

not connected, then

4. a kind of trajectory prediction method based on spatiotemporal graph and space domain aggregation Transformer network as claimed in claim 2, it is characterized in that, the described adjacency matrix under the meaning of selecting suitable kernel function construction graph from multiple kernel functions is specifically: :

The weighted adjacency matrix A _t is introduced to represent the node information of the pedestrian spatial graph by weight, and the magnitude of the mutual influence between pedestrians is obtained through the kernel function transformation and stored in the weighted adjacency matrix A _t ;

The inverse of the distance between two nodes in Euclidean space is used as the kernel function, and in order to avoid the function divergence problem caused by the two being too close, a tiny constant ε is added to accelerate the model convergence, the expression is as follows:

In the time dimension, the spatial graph G _t at each moment is stacked, that is, the pedestrian trajectory prediction spatio-temporal graph sequence G={G ₁ ,...,G _T } under the graph representation is obtained.

5. a kind of trajectory prediction method based on space-time map and space domain aggregation Transformer network as claimed in claim 4, is characterized in that, described step (2) is specifically:

For the input feature map time series, the output is obtained through the established spatiotemporal map convolutional neural network:

e _t =GNN(G _t ) (1.6)

Among them, GNN represents the constructed spatiotemporal graph convolutional neural network, which obtains the output result by iterative multi-layer graph convolution; e _t represents the spatiotemporal feature information initially extracted from the spatial dimension through the graph neural network;

For the output of each moment, there is such an operation; and the output obtained by the actual graph convolutional neural network is a stack of such time series:

e _g = Stack(e _t ) (1.7)

Among them, Stack( ) represents the superposition of the input on the extended dimension, and e _g represents the output of the graph convolution; in the actual processing process, multiple extended dimensions are simultaneously sent to the graph neural network for processing in parallel;

Then, through a fully connected layer FC, the features are appropriately dimensionally transformed:

V _GNN = FC(e _g ) (1.8)

Thereby, the preliminary extraction output of the feature information of the spatiotemporal graph convolutional neural network is obtained.

6. a kind of trajectory prediction method based on spatiotemporal graph and space domain aggregation Transformer network as claimed in claim 1, is characterized in that, in described step (3), the output of spatiotemporal graph convolutional neural network is transformed through appropriate dimension, Using a CNN-based time series feature transformation network module and designing the number of convolutions to complete the extraction of the pedestrian's own historical trajectory feature information.

7. a kind of trajectory prediction method based on spatiotemporal map and space domain aggregation Transformer network as claimed in claim 6, is characterized in that, described step (3) is specifically:

After obtaining the feature extraction information of the spatiotemporal graph convolutional neural network, it is sent to a time series feature transformation network to extract the time series features; since the dimension features have been appropriately transformed through a fully connected layer in step 2, the The network module directly utilizes the obtained feature information; in the present invention, the multi-layer CNN convolutional neural network is selected to process the time dimension feature information, which can be expressed as:

_ec = CNN(V _GNN ) (1.9)

Among them, V _GNN represents the feature information extracted from the graph convolutional neural network, and e _c represents the output of the time series feature transformation network; then a multi-layer perceptron MLP is used to increase the expressive ability of the network:

V _CNN = MLP( _ec ) (1.10)

The feature transformation and processing are performed through the above network, that is, the output _VCNN of the time series feature transformation network is obtained.

8. a kind of trajectory prediction method based on spatiotemporal map and spatial domain aggregation Transformer network as claimed in claim 1, is characterized in that, described step (4) is specifically:

The feature vector of the same pedestrian in the time series is input as the input vector, and the extracted features of different pedestrians are input in turn;

For the spatial aggregation Transformer network, the encoder layer of the Transformer architecture is selected, and position encoding is first added to the input:

V _in = V _CNN +PE _pos,i (V _CNN ) (1.11)

Among them, pos represents the relative position of the input feature, and i represents the dimension of the input feature; then a multi-head attention layer is introduced, and the three attention layers obtained by matrix transformation from the input layer are used to input Query, Key, and Value, according to the set number of heads. Divide the input features and calculate the attention score, the expression is as follows:

head _i =Attention(Q _i ,K _i ,V _i ) (1.13)

Among them, i=1,...,nhead, nhead represents the number of heads; and the final output of multiple heads completes feature extraction by splicing, and the expression is as follows:

V _Multi = ConCat(head ₁ ,...,head _h )W _o (1.14)

Among them, _ConCat represents the concatenation operation, and Wo represents the parameter matrix output by the attention layer.

Then, the final output of the spatial Transformer is completed through the feedforward neural network and layer normalization, which is expressed as:

V _out = LN(Feedback(V _Multi )) (1.15)

In terms of loss function, the sum of the negative log-likelihood of each point on the pedestrian predicted trajectory is selected as the loss function; the loss function of the i-th pedestrian is expressed as follows:

in,

By performing forward loss function calculation and reverse parameter update on the above model architecture, the training of the model is completed, and a reasonable pedestrian predicted trajectory output is obtained.