CN115527272A

CN115527272A - Construction method of pedestrian trajectory prediction model

Info

Publication number: CN115527272A
Application number: CN202211253854.1A
Authority: CN
Inventors: 王斌; 段安盛
Original assignee: Shanghai Normal University
Current assignee: Shanghai Normal University
Priority date: 2022-10-13
Filing date: 2022-10-13
Publication date: 2022-12-27

Abstract

The invention discloses a construction method of a pedestrian trajectory prediction model, and belongs to the field of pedestrian trajectory prediction. A construction method of a pedestrian trajectory prediction model comprises the following steps: constructing a space-time attention diagram convolution network model; inputting time embedding, and carrying out time marking by adding a time position coding vector; inputting space embedding, and marking the position of the pedestrian through a space position coding vector; calculating an attention matrix by adopting an attention mechanism; obtaining a temporal interaction graph representing temporal interactions from the temporal graph input, and obtaining a spatial interaction graph representing spatial interactions from the spatial graph input; aggregating the final time interaction matrix and the final space interaction matrix through a graph convolution network, and learning track representation; and training the model through a data set to obtain a final pedestrian track prediction model.

Description

Construction method of pedestrian trajectory prediction model

Technical Field

The invention relates to the field of pedestrian trajectory prediction, in particular to a construction method of a pedestrian trajectory prediction model.

Background

Predicting pedestrian trajectories requires modeling two key dimensions shown in fig. 1: (1) A time dimension in which we model valid information such as position and speed in a pedestrian's past trajectory to capture a time correlation and then predict the pedestrian's next position; (2) Spatial dimensions, where we construct a spatial directed graph for pedestrians in the same scene at the same time to obtain spatial interaction between the pedestrians. Collision can be avoided by predicting pedestrian trajectories through spatial interaction.

Pedestrian trajectory prediction is a key technology in autopilots, which remains difficult due to the complex interaction between pedestrians and the uncertainty of each pedestrian's future actions. Past work has relied primarily on pedestrian positional relationships to independently model temporal dependencies or spatial interactions, which is not sufficient to represent real-world complications. One of the main challenges of pedestrian trajectory prediction is the modeling coupled with temporal dependencies and spatial interactions. Since the spatial and temporal dynamics of pedestrians are closely dependent on each other. Specifically, the prior art has always used a temporal model to summarize each pedestrian's time-varying features independently to predict the pedestrian's next position, or a spatial model to model each pedestrian's spatial interaction to predict the walking trajectory. These methods are suboptimal because the information of the pedestrian in the temporal and spatial dimensions is not taken into account in a combined manner.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a construction method of a pedestrian trajectory prediction model.

The purpose of the invention can be realized by the following technical scheme:

a construction method of a pedestrian trajectory prediction model comprises the following steps:

constructing a space-time attention diagram convolution network model;

inputting time embedding, and carrying out time marking by adding a time position coding vector; embedding an input space, and marking the position of a pedestrian through a space position coding vector;

calculating an attention matrix by adopting an attention mechanism;

obtaining a temporal interaction graph representing temporal interactions from the temporal graph input, and obtaining a spatial interaction graph representing spatial interactions from the spatial graph input;

aggregating the final time interaction matrix and the final space interaction matrix through a graph convolution network, and learning track representation;

and training the model through the data set to obtain a final pedestrian trajectory prediction model.

Optionally, the data set comprises an ETH data set and a UCY data set.

Optionally, the spatial map represents spatial interaction of all pedestrians in the scene, and the temporal map represents a complete trajectory of each pedestrian.

Optionally, a mechanism of attention is given to the sequential concept in the input/output sequential data by the position encoder layer.

The prediction model is constructed by the construction method of the pedestrian trajectory prediction model.

The invention has the beneficial effects that:

the invention relates to a pedestrian prediction model time graph embedding and space graph embedding combined with a position coding mechanism, which is constructed by the invention, so as to solve the problem that an attention module is insensitive to the position. The GCN is used to couple temporal and spatial features to prevent loss of critical spatio-temporal information.

2 the pedestrian prediction model constructed by the invention provides a new decoder, the number of layers of the decoder is less than that of TCNs convolution layers, and the defects that the traditional RNN is affected by gradient elimination and high calculation cost can be avoided.

3 embodiments of the invention the method of the invention was evaluated on a complete pedestrian data set ETH and UCY. On ETH/UCY, this example verifies that the method constructed by the present invention performed better than the most advanced pedestrian trajectory prediction method in the last 4 years, and achieved significant performance improvement (average displacement error of 10%, final displacement error of 21%). Extensive ablation studies were further conducted in the examples of the present invention to demonstrate the superiority of the STAGCN over various temporal and spatial model combinations.

Drawings

The invention will be further described with reference to the accompanying drawings.

FIG. 1 illustrates two key dimensions of pedestrian trajectory prediction in the prior art;

FIG. 2 is a convolutional network for spatiotemporal attention of the present application;

FIG. 3 is a visualization of trajectories in the present application for a pedestrian side-by-side walking, encountering and turning scenario;

FIG. 4 is a visualization of the trajectory in the pedestrian turning and stationary scenes of the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In an embodiment of the invention, a spatiotemporal attention-graph convolutional network (STAGCN) for pedestrian trajectory prediction is disclosed. As shown in fig. 2, we use an attention mechanism to obtain temporal dependencies and spatial interactions. Before using the attention mechanism, we add position coding in the temporal map embedding and the spatial map embedding to solve the problem that the attention mechanism is insensitive to the input element sequence. Two-layer graph convolution networks can aggregate temporal dependencies and spatial interactions and learn trajectory representations. The final trajectory representation is obtained by adding gaussian noise to the trajectory representation. Given the final trajectory representation, we can use a decoder to predict the parameters of the double-gaussian distribution in the time dimension for future trajectory point prediction.

The embodiment of the invention discloses a construction method of a pedestrian track prediction model, which comprises the following steps of;

the formula of the model is preliminarily constructed, and N pedestrians T epsilon {1, \ 8230;, T ∈ in a scene in a period of time are assumed _obs ,…,T _pred }. At time step t, the position of pedestrian i is represented by a pair of two-dimensional Cartesian coordinates

The pedestrian position i =1,2, \ 8230;, N, at time step T =1, \ 8230;, T _obs Interest in the problem of predicting future trajectories from time steps T = T _obs +1tot＝T _pred 。

Given an input trajectory

And

where D represents the dimension of the 2D Cartesian coordinates, N represents all pedestrians at time step T, T _obs Representing the first 8 time steps. A series of time charts are constructed to represent the complete trajectory of each pedestrian. From time step T =1 to T = T _obs Each coordinate being connected to form a time graph G _tem (V _i ,U _i )，

Represents G _tem A node of, and

is the coordinate of the ith pedestrian at time step t

Represents G _tem Wherein

Representing nodes

If connected, they are represented as 1, otherwise they are represented as 0. Due to time dependence, U _i The element in (1) is initialized to 1 (Shi et al, 2021). For attention-driven processing of inputs, they are embedded in a higher D-dimensional space through a simple fully connected layer：

e _tem ＝φ(G _tem ,W _e ^tem ) (1)

Where phi (.) represents a linear transformation,

is the embedding of the time map(s),

is the time map embedding weight.

The spatial map represents the spatial interaction of all pedestrians in the scene. At time step t, the coordinates of all pedestrians are connected to form a spatial map G _spa (V ^t ,U ^t )。

Is G _spa Represents all pedestrians at time step t.

Is the observed coordinate position

Is G _spa The edge set of (1), wherein

Representing nodes

Whether connected (denoted 1) or disconnected (denoted 0). U shape ^t Initialisation to fill 1 of the upper triangular matrix, i.e. the current state is independent of future statesState (Shi et al, 2021). Embedding into higher D-dimensional space by a simple fully connected layer:

e _spa ＝φ(G _spa ,W _e ^spa ) (2)

wherein phi (a.) represents a linear transformation,

is a spatial map embedding, and the spatial map embedding,

are the embedding weights.

Likewise, the output of the pedestrian i-model of the invention at time t is a D-dimensional final trajectory representation, which is back-projected to Cartesian human coordinates

The position encoder layer gives a mechanism of attention to the sequential concept in the input/output sequential data. In the model of the invention, the input is embedded

By adding a time-position-coding vector having the same dimension D at time t

Time stamping is performed. Embedding

By adding a spatial position-coding vector having the same dimension D

Position marking is carried out on the pedestrian i:

in this formula, the position-coded p-vector is defined using the broad spectrum of sine/cosine functions as follows:

where k represents position and the dimension is denoted by d. From the above formula, this means that for each position k of the p-vector, there is a corresponding sinusoid that spans a frequency range of 2 pi to 1000 · 2 pi. In other words, this would allow the model to look at the order of the sequential data using unique relative positions. Coding a vector p in temporal position _tem Where k denotes the position information of the person at time step k in the complete observation trajectory, and the position code takes care of the order in the sequential position information using unique relative positions. Coding a vector p in spatial position _spa K represents the location information of the pedestrian k in the space map, and the location code ensures the identity of the pedestrian in the scene using a unique relative location.

To extract temporal dependencies and spatial interactions, an attention mechanism is first employed to compute a temporal attention matrix

And space attention moment array

The formula is as follows:

wherein

And

respectively series and keys of the self-attention mechanism.

Is the weight of the linear transformation. d _tem And d _spa Is the dimension of each query.

And

is to implement a scaled dot product term for numerical stability. GPU acceleration may be used to compute the temporal dependencies and spatial interactions of all pedestrians in parallel.

Attention matrix Att of time of each pedestrian _tem Are superimposed into

And the step of time is from T =1 to T = T _obs Spatial attention matrix Att of _spa Are superimposed into

Due to the variation in the number of pedestrians in different scenes, spatial interaction cannot be learned by convolution along spatial channels. Direct interaction matrix of time

Viewing as a time interaction matrix

These stacked spatial attention matrices are then fused with a1 × 1 convolution along the temporal path to obtain a spatial interaction matrix

By passing

And

using a hyper-parameter alpha epsilon [0,1 ]]Generating a temporal mask M by element thresholding _tem And a spatial mask M _spa 。

Wherein

Is an indication function, if the above formula satisfies the condition, 1 is output, otherwise 0.σ is the Sigmoid activation function. Adding two identity matrices D to a temporal mask M, respectively _tem And spatial mask M _spa To ensure that the nodes are self-connecting. Then, the time interaction matrix is

Multiplying the corresponding elements of the two matrices between the time masks added from the connection matrix to obtain the final time interaction matrix F _tem . The final spatial interaction matrix F is obtained in the same way _spa 。

Wherein |, indicates element multiplication. Thus, a time interaction graph G representing time interactions is finally obtained from the time graph input _tem (V _i ,F _tem ). Finally, a space interaction graph G representing space interaction is obtained from the space graphic input _spa (V ^t ,F _spa )。

The Graph Convolution Network (GCN) takes as input a feature matrix representing the attributes of each node and effectively aggregates features in the neighborhood defined by the adjacency matrix. Static binary adjacency matrices are typically used for the training of the GCN. On the other hand, the entries in the adjacency matrix may be continuous real-time functions, allowing for adaptive and dynamic aggregation of neighbor information. When the GCN is used to encode the states, the interaction between the nodes can be easily modulated by changing the adjacency matrix. Polymerization of the Final time interaction matrix F by GCNs _tem And the final spatial interaction matrix F _spa And learn the trajectory representation. Two GCNs are used to learn the trajectory representation, in one branch, F _tem At F _spa Previously fed to the network and in the other branch they are fed in reverse order. Thus, the first branch generates a time trace representation

While the other branch produces a spatial trajectory representation

Track representation

Is the last GCNs output I _temporal And I _spatial The sum of (a) and (b).

I _temporal ＝δ(δ(F _tem ×Traj _tem )×F _spa ) (17)

I _spatial ＝δ(δ(F _spa ×Traj _spa )×F _tem ) (18)

I＝I _temporal +I _spatial (19)

Where δ is the pre-activation function.

In addition, the real motion mode of the pedestrian in the changing environment is simulated from a large amount of real pedestrian trajectory data. Due to the uncertainty of pedestrian motion, it is expected that the model of the present invention will be able to create multiple reasonable and realistic trajectories. Yu et al propose various penalties to encourage the network to make different future trajectories and their method has proven effective (Yu et al, 2020). The multimodal nature of pedestrian motion is modeled following their approach (Mohamed et al, 2020). The input to the decoder is the final track representation I _final It consists of two parts: trajectory representation I and noise addition

(as shown in fig. 2). The random Gaussian noise can enable the model to be well regularized, and the robustness of the model is improved. Final trajectory representation

The formula is as follows:

wherein

Is a series operation. Z is random gaussian noise.

Assuming future trajectories

A bivariate gaussian distribution will be followed,

further, the predicted trajectory is defined as

It follows a calculated bivariate distribution

Given a final trajectory representation I _final The decoder can be predicted to have a double gaussian distribution parameter in the time dimension. The decoder includes a convolutional layer and a simple fully-connected layer. The decoder is designed because it does not suffer from gradient extinction and high computational cost as does a conventional RNN. The decoder size of the present invention is smaller compared to TCN. The model of the present invention is trained to minimize negative log-likelihood, defined as follows:

in embodiments of the present invention, ETH and UCY data sets are used for training and testing. The ETH and hotel scenarios are contained in the ETH dataset, while the UCY dataset has three different scenarios: UNIV, ZARA1 and ZARA2. The 2D position of the frame number, pedestrian number, and trajectory coordinate is a data attribute. Four data sets are trained using a leave-one-out method and tested in the remaining data sets. Similar to social LSTM (Alahi et al, 2016), a trace of 8 time steps (3.2 seconds) is also entered and the next 12 time steps are predicted.

The standard indices are the Average Displacement Error (ADE) and the Final Displacement Error (FDE) (Choi and Dariush,2019, mangali et al, chai et al, 2020; kothari et al, 2021). In short, the ADE measurement method predicts the average L-2 distance between a track point and all ground truth future track points, while the FDE measurement method ultimately predicts the L-2 distance between the destination and the final destination of the ground truth future track points.

Specifically, the coordinate dimension is 2, and the graph embedding dimension and the attention embedding dimension are 64. The number of self-attentive layers is 1. The convolutional network of the spatial encoder consists of 7 convolutional layers, with a kernel size of s =3. The structure of the graph convolution network used in the model is similar to social STGCNN (Mohamed et al, 2020). Spatial GCN and temporal GCN cascade 1 layer. The threshold α is empirically set to 0.5. The dimension of the random gaussian noise Z is set at 32 using prilu as the nonlinear activation δ (·). The input dimension of the decoder is 64 and the output dimension of the decoder is 5. The proposed method uses Adam optimizer training 300 data batches, with a data batch size of 128. The initial learning rate was set to 0.01, with a decay factor of 0.1 at intervals of 50 data periods. The model of the invention was trained on GeForce RTX 3090 for 300 epochs using an Adam optimizer and a learning rate of 0.01. The method of the invention is implemented on PyTorch.

In this example, an existing method is selected as the baseline.

SGAN: a pedestrian trajectory prediction method combines sequence prediction with generation of an anti-net tool. (Gupta et al, 2018).

Sophie: GAN-based prediction methods consider both context information of the scene and path history of all agents using an attention mechanism (Sadeghian et al, 2019).

Social BiGAT: a method of simulating social interaction of pedestrians in a scene using a loop-countermeasures structure and introducing a graphical attention network (Kosaraju et al, 2019).

RSBG: a new insight is presented for a group-based social interaction model in conjunction with GCN for predicting pedestrian trajectories (Sun et al, 2020).

PITF: the PITF proposes an end-to-end multitask learning system that utilizes rich visual features regarding human behavioral information and its interaction with the surrounding environment (Liang et al, 2020).

Social STGCNN: a method of modeling spatial interaction as a graph and using a time-extrapolated convolutional neural network to predict future further steps (Mohamed et al, 2020).

STAR: a new space-time transformer and time transformer were introduced to capture the spatio-temporal interactions between pedestrians (Yu et al, 2020).

NMMP: a neuromotor messaging for interactive modeling is presented that can predict future trajectories in a variety of scenarios (Hu et al, 2020).

Carre mendiita and Tabkhi propose a convolution method for real-time pedestrian path prediction using various graph isomorphic networks, combined with flexible convolutional neural network design (mendiida and Tabkkhi, 2021).

AVGCN: a new method for trajectory prediction using a Graph Convolution Network (GCN) based on human attention (Liu et al, 2021).

The embodiments of the present invention disclose the results of pedestrian trajectory data sets, which are the most widely used benchmarks for trajectory prediction tasks, and compare the results with other most advanced methods. The results are shown in Table 1 and evaluated using ADE and FDE indices. Compared with the existing method, the prediction model constructed by the invention is superior to ETH and UCY data sets. Especially for FDE metrics, the prediction model constructed by the present invention is 20% better on ETH and UCY datasets than the previous best method NMMP. For ADE measurements, the mean values of the prediction models constructed by the present invention over the ETH and UCY datasets exceeded the previous best method NMMP 9%.

It can be seen that the model of learning spatial interactions using graphical representations is superior to other methods, such as social BiGAT, social STGCNN, and STAR-D, over the UNIV sequence, which mainly contains dense crowd scenes. Interestingly, the method of the present invention is superior to all the methods described above. STAR-D is the first model to propose coupled temporal dependencies and spatial interactions, which is superior to independently modeling spatial interactions, such as society BiGAT and society STGCNN. The approach of coupling temporal dependence with spatial interaction by GCN is advantageous to achieve better performance than STAR-D. The results show that the prediction model constructed by the present invention couples temporal dependence and spatial interaction more successfully than STAR-D.

TABLE 1 comparison of quantitative methods with baseline methods

In addition, the results of the present invention were further evaluated using Sophie's percent near impact (if the distance between two pedestrians is less than 0.1m, two people are judged to collide with each other). Table 2 shows the average percentage of pedestrian short range collisions in each UCY/ETH scene across all frames. With this new assessment method (Sadeghian et al, 2019), the prediction model of the present invention is far superior to other methods, which means that it can design more socially and physically acceptable trajectories for each pedestrian.

TABLE 2 average percentage of human collisions in each scene

As shown in table 3, three different variants were evaluated, wherein: (1) STAGCN w/o PE indicates that position coding is deleted in the method of the invention; (2) STAGCN w/o SD shows that the spatial encoder is removed in the method of the invention, where it models only the time dependency; (3) STAGCN w/o TD indicates that the spatial encoder is removed in the method of the invention, where it models only spatial interactions; (4) STAGCN w/o Z shows that the method removes random Gaussian noise; (5) STAGCN is the result of the complete model. From the results, it can be seen that the removal of the temporal and spatial encoders from the model results in a significant performance degradation. In particular, the results of STAGCN w/o SD show that the performance degradation of ADE and FDE is 73% and 72%, respectively, which verifies the contribution of the time encoder to the final performance of the pedestrian trajectory prediction. Furthermore, the results of STAGCN w/o TD show a 62% and 61% decrease in ADE and FDE performance, respectively, indicating that the spatial encoder is also important for pedestrian trajectory prediction. The results of STAGCN w/o PE show that the performances of ADE and FDE are respectively reduced by 3% and 9%, which shows that position coding can eliminate the variable limits such as replacement and the like and improve the performance of the model. The results of STAGCN w/o Z show that the performance of ADE and FDE is reduced by 8% and 14% respectively, which shows that random Gaussian noise can enable the model to be regularized better and improve the performance of the model.

TABLE 3 ablation study

Several common interaction scenarios are visualized in fig. 3, where the point at the end of each trajectory represents the start. The method of the present invention is compared to the socialized STGCNN because it learns spatial interactions using graphical representations and learns the parameterized distribution of future trajectories.

The parallel, encounter, stop, turn and mix situations in fig. 3 and 4 were chosen. In the scenes that pedestrians walk side by side, encounter and turn, the track distribution predicted by the two models is visualized. The different colored regions represent the future trajectory profiles of different pedestrians. The yellow line represents the pedestrian's historical track (8 frames) and the red line represents the ground truth (12 frames). The blue line represents the predicted trajectory of the STAGCN (12 frames). The green line represents the predicted trajectory of social STGCNN (12 frames). The visualization shows that the predicted trajectory (12 frames) is closer to the ground truth (12 frames), while the social STGCNN predicted trajectory is much different from the ground truth, which means that the prediction model of the present invention is more accurate.

Scenes (a) and (e) show that two people walk side by side, the social STGCNN has deviation problem, and the predicted track of the invention is consistent with the ground true phase track. Scenario (b) is a complex and crowded scenario, social STGCNN has overlapping problems, while the model of the present invention still has good performance. Scenario (c) shows a person moving in the direction of two people walking side-by-side, the social STGCNN will experience trajectory bias, and the predicted trajectory will shift to the right to avoid collision. Scene (d) shows that the model of the invention can well handle the situations of pedestrian turning and social STGCNN trajectory deviation. In the scene (f), one pedestrian passes through a group of people standing still at the bottom, the predicted track has almost no deviation, and the fact that the static pedestrian is not influenced by other pedestrians is shown when the model captures the scene (f).

After turning in fig. 4 (a), two pedestrians are driving toward each other, the Social STGCNN has a serious deviation, and the predicted trajectory deviation is small. Scenario (b) shows the process of two people slowing down and stopping in fig. 4 (b), and the Social STGCNN has a bad track error, which the model of the present invention can handle well. The society STGCNN is not sensitive to predicting the trajectory of a pedestrian staying for a period of time in fig. 4 (c), but the model of the present invention can accurately distinguish the static state of the pedestrian and give a well-predicted trajectory.

In summary, the society STGCNN predicts overlapping trajectories and deviates from ground truth, while predicted trajectories follow better paths along the ground truth. For parallel, encounter, stop, turn, and mixed cases, social STGCNN predicts off-track because it focuses on spatial interactions independently of time. In contrast, STAGCN combines spatial interaction and temporal dependence to predict trajectories and results in better predicted trajectories.

In the description herein, references to the description of "one embodiment," "an example," "a specific example" or the like are intended to mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed.

Claims

1. A construction method of a pedestrian trajectory prediction model is characterized by comprising the following steps:

constructing a space-time attention diagram convolution network model;

inputting time embedding, and carrying out time marking by adding a time position coding vector; inputting space embedding, and marking the position of the pedestrian through a space position coding vector;

calculating an attention matrix by adopting an attention mechanism;

and training the model through a data set to obtain a final pedestrian track prediction model.

2. The method of constructing a pedestrian trajectory prediction model according to claim 1, wherein the data set includes an ETH data set and a UCY data set.

3. The method of constructing a pedestrian trajectory prediction model according to claim 1, wherein the spatial map represents spatial interactions of all pedestrians in the scene, and the temporal map represents a complete trajectory of each pedestrian.

4. The method of claim 1, wherein the step of calculating the attention matrix using an attention mechanism comprises the steps of:

fusing the spatial attention matrix with the 1 × 1 convolution along the time channel to obtain a spatial interaction matrix;

multiplying the time interaction matrix by corresponding elements of the two matrices between the time masks added from the connection matrix, thereby obtaining a final time interaction matrix; the final spatial interaction matrix is obtained in the same way.

5. A prediction model constructed by the method of constructing a pedestrian trajectory prediction model according to any one of claims 1 to 4.

6. Use of the predictive model of claim 5 in an autonomous driving system.