CN114580715A

CN114580715A - Pedestrian trajectory prediction method based on generation of confrontation network and long-short term memory model

Info

Publication number: CN114580715A
Application number: CN202210149868.2A
Authority: CN
Inventors: 王翔辰; 杨欣; 樊江锋
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2022-02-18
Filing date: 2022-02-18
Publication date: 2022-06-03

Abstract

The invention discloses a pedestrian trajectory prediction method based on generation of a confrontation network and a long-term and short-term memory model, which comprises the following steps: selecting an ETH pedestrian trajectory data set as a data source, selecting a Social GAN mode, and modeling the historical pedestrian trajectory and the current position by means of a long-short term memory network (LSTM) to realize pedestrian trajectory prediction; the generator performs feature analysis on the historical track of each pedestrian in the data by using a long-term and short-term memory network; the discriminator part extracts input characteristics through a plurality of full connection layers and also memorizes the characteristics of the historical track through an LSTM network; considering that the official ETH data set does not contain the person ID tag information, the ETH data set and the supplementary data set thereof are used to successfully train the model, and common track prediction indexes ADE and FDE are selected as performance evaluation indexes.

Description

Pedestrian trajectory prediction method based on generation of confrontation network and long-short term memory model

Technical Field

The invention relates to the technical field of pedestrian trajectory prediction, in particular to a pedestrian trajectory prediction method based on a generated countermeasure network and a long-term and short-term memory model.

Background

Pedestrian trajectory prediction is more challenging than other moving objects in traffic scenarios such as cars and cyclists because pedestrians, which are not moving regularly, neither travel within a prescribed lane as in a motor vehicle nor remain traveling within lane boundaries as in a non-motor vehicle, are generally not moving regularly. Due to a series of well-defined 'road rules', the motion states of the automobile and the bicycle are restricted because the owner of the automobile obeys the rules. However, pedestrians are different from each other, there is no corresponding law and regulation to limit what trajectory pedestrians should move according to, and when there is an intersection without traffic lights in a traffic scene or there are a large number of pedestrians, the movement of pedestrians becomes more complicated. Therefore, an effective pedestrian motion trajectory prediction algorithm is needed to address these challenges.

Predicting the pedestrian's motion trajectory requires consideration of a number of factors that may affect the pedestrian's motion. In recent years, most researches have been started from the perspective of pedestrian behaviors, for example, the mechanism of interaction between vehicles and pedestrians is searched by considering the reaction of pedestrians to oncoming automobiles at intersections without traffic lights; predicting when a pedestrian crosses a street; in addition, the on-line prediction of pedestrian behavior requires data acquisition from sensors and extraction of various cues, for example, using machine vision techniques to obtain various types of contextual cues.

1. Prediction based on static environmental cues:

some scholars propose behaviors CNN, and adopt a neural network to model pedestrian behaviors in crowded scenes and prove the effectiveness of the pedestrian behaviors. Some learners learn the weighted sum of ordinary differential equations from historical track information in a fixed scene, and provide a new pedestrian position prediction method and verify the good effect of the pedestrian position prediction method.

(2) Prediction based on dynamic environmental cues:

pedestrian behavior can also be affected by other dynamic objects in the scene. Some scholars study a microscopic simulation model to analyze the behavior of the cyclist at the intersection without traffic lights. In addition to considering the existence of other road users, the traffic participants should negotiate the traffic sequence to coordinate the traffic behavior, and some scholars provide a new data set for studying the behavior of the traffic participants and repeatedly study the way in which drivers communicate with pedestrians to avoid collisions with each other.

(3) Target cue based prediction:

people are not familiar with the surrounding environment at all times, and the inattention of pedestrians is often a significant cause of traffic accidents. Whether the pedestrian notices the driving vehicle can be judged by depending on the head direction of the pedestrian, and the specific method is to use the results of a plurality of discrete directional classifiers and add physical constraint and time filtering to improve the robustness to obtain continuous head direction estimation. In addition, the neural network can be used for realizing real-time 2D estimation of the whole body skeleton of the human body, and further the effect of detecting a plurality of human postures in the image is achieved. The pedestrian's whole body appearance can also be used for trajectory prediction, e.g., classifying objects and predicting the trajectory or pose 7 of a particular class of objects as 1; dense optical flow features may also be employed around pedestrian bounding boxes to estimate whether a pedestrian crossing will stop at the roadside.

Disclosure of Invention

In view of the above, the present invention provides a pedestrian trajectory prediction method based on a generated countermeasure network and a long-short term memory model, so as to train a new neural network model and improve the prediction performance.

In order to achieve the purpose, the invention adopts the following technical scheme:

a pedestrian trajectory prediction method based on a generated countermeasure network and a long-short term memory model, the pedestrian trajectory prediction method comprising the steps of:

step S1, acquiring an ETH data set and an ETH supplementary data set, and matching the ETH data set and the ETH supplementary data set to acquire a definite track of each character object in the ETH data set, wherein the definite track of each character object is used as a data set for training and testing;

step S2, dividing the training data set obtained in step S1 into a training set and a test set according to a certain proportion, and preprocessing;

step S3, constructing a pedestrian trajectory prediction network, which comprises: the pedestrian trajectory prediction network comprises a generator network and a discriminator network, wherein the generator network and the discriminator network form a generation countermeasure network;

step S4, inputting the training set obtained in the step S2 into the pedestrian trajectory prediction network constructed in the step S3, performing model training, performing multiple rounds of iterative training until a loss function is converged, and fixing network parameters;

step S5, performing pedestrian trajectory prediction, including: and inputting the test set obtained in the step S2 into the pedestrian trajectory prediction model obtained in the step S4 for prediction, so as to obtain a prediction result.

Further, the step S1 specifically includes:

step S101, firstly, an ETH data set is obtained, wherein the ETH data set is a parameter composed of a time label, a person ID label and a person position point coordinate (x, y), then idl label files in the ETH data set are read in a character string mode, then effective information in idl label files is read through a universal code re module in Python, and finally the effective information is exported in a csv table mode;

step S102, firstly, an ETH supplementary data set is obtained, wherein no additional tag data file exists in the data set, and only all the figure pictures are segmented; then, naming each picture in the ETH supplementary data set by referring to the label information in the ETH data set, and dividing all the pictures into different folders according to the classification of people; finally, acquiring a file directory of each folder through a universal code os module in the Python;

step S103, matching the csv table obtained in the step S101 and each file directory obtained in the step S102 in a data list in Python one by one to obtain an explicit track of each person object in the ETH data set;

and step S104, dividing the track data into observation track data, prediction track data, a time list and an ID list, and constructing a data set for model training and testing.

Further, the step S2 includes:

step S201, dividing the data set constructed in the step S104 into 4 parts according to the dividing ratio: 1, dividing data into a training set and a test set;

step S202, all data are standardized to be between 0 and 1.

Further, the generator network further comprises: the device comprises an encoder, a decoder, a social characteristic embedding layer and a social pooling layer, wherein the encoder and the decoder both comprise a full connection layer and an LSTM layer;

in the generator network, a decoder and a social characteristic embedding layer are arranged in front, the input of the decoder and the output of the social characteristic embedding layer are track data, the output of the decoder and the output of the social characteristic embedding layer are transmitted to the social characteristic embedding layer, the output of the social characteristic embedding layer is combined with noise and input to the decoder, and the output of the decoder is transmitted to a discriminator network, wherein in the encoder, a fully connected layer is arranged in front, and then a plurality of LSTM networks are connected, correspondingly, in the decoder, a plurality of LSTM networks are arranged in front, and then a fully connected layer is connected.

Further, in the discriminator network, one full connection layer is disposed in front of the discriminator network, and another full connection layer is disposed behind the discriminator network, and a plurality of LSTM networks are disposed between the two full connection layers.

Further, the calculation formula of the discriminator network is as follows:

in formula (1) to formula (4), t is 1.,t_obs,...,t_obs+t_prep，T_iRepresenting a union of true and false tracks, to_bsFor the length of time of the past trace, t_prepFor future track time lengths, X_i，Y_iIs a position coordinate, delta is a full link layer for converting two-dimensional coordinates into a feature vector, W_δIs the full link layer parameter; the LSTM layer encodes the feature vector at each time until t-t_obs+t_prepThe time is over to obtain the sequence coding vector

W_diIs a parameter of the LSTM layer; rho is a multilayer perceptron, W_ρFor the multi-layer perceptron parameter, a score s for the trajectory is obtained_i。

7. The pedestrian trajectory prediction method based on the generation of the countermeasure network and the long-short term memory model according to claim 5, wherein the calculation formula of the encoder is as follows:

in formula (5) to formula (6), t is 1_obs，t_obsFor the length of the time of the past trace,

mu is a full connecting layer for converting the two-dimensional coordinate into a characteristic vector for the position of the pedestrian i at the moment t,

is a position feature vector, W_μeFor the full-concatenation layer parameters, the LSTM layer encodes the eigenvectors at each instant until t ═ t_obsThe moment is finished to obtain a sequence context vector

W_eAre parameters of the LSTM layer.

Further, the calculation formula of the decoder is as follows:

in formula (7) -formula (11), t ═ t_obs+1,...,t_prep，c_iThe tensor of the pedestrian i obtained by the attention module represents the influence of other pedestrians on the pedestrian i in the scene; phi is a multi-layer sensor, W_φFor the purpose of the multi-layer perceptron parameter,

representing the influence result of other pedestrians on the pedestrian i at the time t; concatenating the result with noise z to obtain the decoder LSTM initial hidden state

Mu is a full connection layer for connecting two-dimensional coordinates of the previous moment

Conversion into a feature vector, W_μdIs the full link layer parameter; the LSTM layer encodes each time instant and the hidden state of each time instant

The predicted coordinate of the t moment can be obtained through the multilayer perceptron gamma

W_dFor LSTM decoder parameters, W_γIs the multilayer perceptron gamma parameter.

Further, the total loss function used is:

L＝L_GAN(G,D)+λL_L2(G) (12)

in the formula (12), L_GAN(G, D) is expressed as the loss of antagonism, L_L2(G) Indicated as L2 loss, wherein,

in equations (13) and (14), λ is a hyperparameter for balancing the generation of the impedance loss and the L2 loss, k is a hyperparameter representing the number of samples taken of the result generated by the generator, T_iRepresenting a union of true and false tracks, X_i，Y_iFor the position coordinates, z represents noise and p represents the number of training sessions.

The invention has the beneficial effects that:

in the invention, the situation that the official ETH data set does not contain the person ID label information is considered, the model is successfully trained by using the ETH data set and the supplementary data set thereof, and common track prediction indexes ADE and FDE are selected as performance evaluation indexes. Experimental results show that the model provided by the invention has better performance in pedestrian trajectory prediction.

Drawings

Fig. 1 is a schematic structural diagram of a pedestrian trajectory prediction network provided in embodiment 1;

FIG. 2 is a schematic flow chart of a pedestrian trajectory prediction process;

fig. 3 is a schematic structural diagram of the generation countermeasure network provided in embodiment 1.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

Referring to fig. 1-3, the present embodiment provides a pedestrian trajectory prediction method based on generation of a confrontation network and a long-short term memory model, which specifically includes the following steps:

step S1, an ETH data set and an ETH supplementary data set are obtained and are matched to obtain the clear track of each person object in the ETH data set.

Specifically, in this embodiment, in step S1, based on Python, the parameter import module is first called, and then according to the obtained parameters, a data file path is determined, a saved file directory is determined, a flag bit of an execution code is determined, and initialization of a training hyper-parameter is performed. The parameter import is configured by a time stamp, a person ID stamp, and a person position point coordinate (x, y). The data set processing file contains three functions. The first function is used for reading idl the label file in a character string form, reading the useful information through the regular expression module, and exporting the useful information in a csv table form. The second function is realized by reading the csv file generated in the first function, acquiring the tag information in the ETH supplementary data set, matching one by one under the list, and sorting the data into a required format, so that the clear track of each character object in the ETH data set can be obtained. The third function is to store the training data, and divide the trajectory data into observed trajectory data, predicted trajectory data, time list and ID list.

More specifically, in this embodiment, the step S1 specifically includes the following sub-steps:

step S101, firstly, an ETH data set is obtained, wherein the ETH data set is a parameter formed by a time tag, a person ID tag and a person position point coordinate (x, y), then idl tag files in the ETH data set are read in a character string mode, then effective information in idl tag files is read through a universal code re module in Python, and finally the effective information is exported in a csv table mode.

Specifically, the above-mentioned general code re module is a regular expression module re, which can check whether a character string matches a certain pattern, and in this embodiment, the module is used to read in irregular data in an ETH data set in a required manner.

Step S102, firstly, an ETH supplementary data set is obtained, wherein no additional tag data file exists in the data set, and only all the figure pictures are segmented;

then, naming each picture in the ETH supplementary data set by referring to label information in the ETH data set, and dividing all the pictures into different folders according to the classification of people;

and finally, acquiring the file directory of each folder through an os module.

Specifically, in this embodiment, the os module is a general code in Python.

Step S103, the csv table obtained in step S101 and each file directory obtained in step S102 are matched one by one in a data list in Python to obtain an explicit track of each person object in the ETH data set.

Step S2, dividing and preprocessing the data set, which specifically includes the following sub-steps:

step S202, normalizing all data, wherein the data normalization is in the form of a class, and the class reads the maximum value and the minimum value in the data, and normalizes all data to be between 0 and 1 by taking the maximum value and the minimum value as a normalization basis. The use of this class is to define data boundaries for this class and identify the maximum and minimum values. The data to be processed can then be normalized by calling a normalizing internal function. When observing the final output, denormalization may also be invoked to restore the data to its original standard.

Step S3, constructing a pedestrian trajectory prediction network, which specifically comprises the following steps:

the pedestrian trajectory prediction network comprises a generator network and a discriminator network, wherein the generator network and the discriminator network form a generation countermeasure network. The generator network further comprises: the device comprises an encoder, a decoder, a social characteristic embedding layer and a social pooling layer, wherein the encoder and the decoder both comprise a full connection layer and an LSTM layer;

specifically, in the generator network, a decoder and a social characteristic embedding layer are arranged in front, the input of the decoder and the output of the social characteristic embedding layer are track data, the output of the decoder and the output of the social characteristic embedding layer are transmitted to the social characteristic embedding layer, the output of the social characteristic embedding layer is combined with noise and input to the decoder, the output of the decoder is transmitted to a discriminator network, wherein in the encoder, a fully connected layer is arranged in front, and then a plurality of LSTM networks are connected, correspondingly, in the decoder, a plurality of LSTM networks are arranged in front, and then a fully connected layer is connected.

Specifically, in the discriminator network, one full-connection layer is disposed in front of the other full-connection layer, and a plurality of LSTM networks are disposed between the two full-connection layers.

The above is the general structure of the pedestrian trajectory prediction network provided in the present embodiment.

More specifically, in this embodiment, a model is built in a Python master function, a first network model, and a coding LSTM layer is composed of a full connection layer and an LSTM layer, the input dimension of the full connection layer is four-dimensional, so that the coordinate and the speed of a character object can be conveniently input, and then the input dimension is converted into the size of a hidden layer and output to the LSTM layer. On the internal function, an LSTM initialization function is required in addition to the front line function to initialize the memory cells and output the initial values. All LSTM layers will then carry such an internal function.

The second network model is a social feature embedding layer, and the embedding layer is generally composed of three fully-connected layers, and the middle of the fully-connected layers is connected by an activation function, so that gradient disappearance or gradient divergence is prevented. Here, the selected activation function is a ReLU () function.

The third network is an attention layer, the layer starts from a full connection layer, then the number of data in the current data block is judged, and the acquired information is manually processed in a matrix multiplication mode to obtain an attention result.

The fourth network is a decoding layer and is used for analyzing the speed predicted value obtained by the network and then conveniently predicting the track by using the speed predicted value.

The fifth network is a decoding LSTM layer, which is not actually used in the code, which is an alternative to the decoding layer in a fully concatenated form, and if this layer is selected, the decoding network is replaced by this layer.

The sixth network is the discriminator network, which is also the most complex as a critical network to complete the encapsulation. The system consists of an LSTM layer, an observation path full-link layer, a prediction path full-link layer, an implicit encoder and a classifier. The generator output, after entering the forward network, passes through the observation LSTM layer first, and then through the observation fully-connected layer. And the input predicted trajectory passes through the predicted fully-connected layer. The two fractions were then connected in series and placed in a classifier. The classifier needs to perform a resolution of the generator data and the prediction data of the validation set. The implicit parameters are propagated back as a numerical output.

More specifically, in the present embodiment, the calculation formula of the discriminator network is as follows:

in formula (1) to formula (4), t is 1_obs,...,t_obs+t_prep，T_iRepresenting a union of true and false tracks, t_obsFor the length of time of the past trace, t_prepFor future track time lengths, X_i，Y_iIs a position coordinate, delta is a full link layer for converting two-dimensional coordinates into a feature vector, W_δIs the full link layer parameter. The LSTM layer encodes the feature vector at each time until t-t_obs+t_prepEnding the moment to obtain a sequence coding vector

W_diAre parameters of the LSTM layer. Rho is a multilayer perceptron, W_ρFor the multi-layer perceptron parameter, a score s for the trajectory is obtained_i。

More specifically, in the present embodiment, the calculation formula of the encoder is as follows:

W_eAre parameters of the LSTM layer.

More specifically, in this embodiment, the calculation formula of the decoder is as follows:

in formula (7) -formula (11), t ═ t_obs+1,...,t_prep，c_iThe tensor of the pedestrian i obtained by the attention module represents the influence of other pedestrians on the pedestrian i in the scene. Phi is a multi-layer sensor, W_φFor the purpose of the multi-layer perceptron parameter,

and showing the influence result of other pedestrians on the pedestrian i at the time t. Concatenating the result with noise z to obtain the decoder LSTM initial hidden state

Conversion into a feature vector, W_μdIs the full link layer parameter. The LSTM layer encodes each time instant and the hidden state of each time instant

And S4, inputting the training set obtained in the step S2 into the pedestrian trajectory prediction network constructed in the step S3, performing model training, and performing multiple rounds of iterative training until the loss function is converged and fixing network parameters.

Specifically, in this embodiment, the pedestrian trajectory prediction network is trained in a conventional GAN training manner, that is, iterative training is performed by using a back propagation algorithm, and more specifically, after the result is output by the discriminator, the discriminator performs back propagation first to change the network parameters of the discriminator. And then the generator is propagated reversely, the part of the discriminator is locked, and the network parameters of the generator are corrected.

And (4) performing multiple rounds of iterative training until the total loss function is converged, and fixing network parameters to obtain a pedestrian trajectory prediction model.

Specifically, in this embodiment, the total loss function is used as follows:

L＝L_GAN(G,D)+λL_L2(G) (12)

in the formula (13) and the formula (14), λ is a hyperparameter for balancing the generation of the opposing loss and the L2 loss. k is a hyperparameter representing the number of samples taken to generate a result for the generator. T is_iRepresenting a union of true and false tracks, X_i，Y_iFor the position coordinates, z represents noise and p represents the number of training sessions. L is_GANIn fact, it is a cross entropy, and the discriminator should make D (T) as much as possible_I) Close to 1, let D (G (X)_iZ)) is close to 0, so the discriminator should maximize L_GANAnd the generator, in contrast, minimizes L_GAN。L_L2For calculating the difference between the predicted value and the true value. For each scene, k tracks are sampled from the generator, and L is selected_L2Minimum trajectory as predicted

Since the discriminator loses regression with only the prediction that differs minimally from the true value, L_L2The penalty may be such that the model generates as many predicted trajectories as possible that are satisfactory.

Specifically, in the prediction mode, model parameters are firstly imported, and then, the observation data and the prediction data before are combined to be used as original data and transmitted into a prediction code for trajectory prediction. And after the prediction is finished, restoring the result to the original range, and writing the result into a text file for convenient viewing.

Detection indexes are as follows:

average offset error (ADE):

final offset error (FDE):

in summary, the patent discloses a pedestrian trajectory prediction method based on generation of a countermeasure network and a long-short term memory model. The method mainly aims to build a model consisting of a generator and a discriminator, select an ETH pedestrian trajectory data set as a data source, select a Social GAN mode, and model historical pedestrian trajectories and current positions by means of a long-short term memory network (LSTM) to realize pedestrian trajectory prediction. Wherein the generator performs a feature analysis of the historical trajectory of each pedestrian in the data using a long-short term memory network (LSTM). The discriminator part extracts the input features through a plurality of full connection layers, and also memorizes the features of the history track through the LSTM network. Considering that the official ETH data set does not contain the person ID tag information, the model is successfully trained by using the ETH data set and the supplementary data set thereof, and common track prediction indexes ADE and FDE are selected as performance evaluation indexes. Experimental results show that the model has better performance in pedestrian trajectory prediction.

The invention is not described in detail, but is well known to those skilled in the art.

The foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations can be devised by those skilled in the art in light of the above teachings. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the prior art according to the concept of the present invention should be within the scope of protection defined by the claims.

Claims

1. A pedestrian trajectory prediction method based on a generated countermeasure network and a long-short term memory model is characterized by comprising the following steps:

step S2, dividing the training data set obtained in the step S1 into a training set and a test set according to a certain proportion, and carrying out pretreatment;

2. The method for predicting pedestrian trajectories according to claim 1, wherein the step S1 specifically comprises:

step S101, firstly, an ETH data set is obtained, wherein the ETH data set is a parameter formed by a time tag, a person ID tag and a person position point coordinate (x, y), then idl tag files in the ETH data set are read in a character string mode, then effective information in idl tag files is read through a universal code re module in Python, and finally the effective information is exported in a csv table mode;

3. The method for predicting pedestrian trajectories according to claim 2, wherein the step S2 comprises:

step S202, all data are standardized to be between 0 and 1.

4. The pedestrian trajectory prediction method based on the generation of the countermeasure network and the long-short term memory model as claimed in claim 3, wherein the generator network further comprises: the device comprises an encoder, a decoder, a social characteristic embedding layer and a social pooling layer, wherein the encoder and the decoder both comprise a full connection layer and an LSTM layer;

5. The pedestrian trajectory prediction method based on generation of the countermeasure network and the long-short term memory model as claimed in claim 4, wherein a fully-connected layer is arranged in front of another fully-connected layer in the discriminator network, and a plurality of LSTM networks are arranged between the two fully-connected layers.

6. The pedestrian trajectory prediction method based on the generation of the countermeasure network and the long-short term memory model according to claim 5, wherein the calculation formula of the discriminator network is as follows:

in formula (1) to formula (4), t is 1_obs,...,t_obs+t_prep，T_iRepresenting a union of true and false tracks, to_bsFor long time of past trackDegree, t_prepFor future track time lengths, X_i，Y_iIs a position coordinate, delta is a full link layer for converting two-dimensional coordinates into a feature vector, W_δIs the full link layer parameter; the LSTM layer encodes the feature vector at each time until t ═ t_obs+t_prepThe time is over to obtain the sequence coding vector

W_diIs a parameter of the LSTM layer; ρ is a multilayer sensor, W_ρFor the multi-layer perceptron parameter, a score s for the trajectory is obtained_i。

W_eAre parameters of the LSTM layer.

8. The pedestrian trajectory prediction method based on the generation of the countermeasure network and the long-short term memory model according to claim 5, wherein the calculation formula of the decoder is as follows:

Wd is the LSTM decoder parameter, W_γIs the multilayer perceptron gamma parameter.

9. The pedestrian trajectory prediction method based on the generation of the countermeasure network and the long-short term memory model according to claim 5, wherein the total loss function is adopted as:

L＝L_GAN(G,D)+λL_L2(G) (12)

in equations (13) and (14), λ is a hyperparameter for balancing the generation of the impedance loss and the L2 loss, k is a hyperparameter representing the number of samples taken of the result generated by the generator, T_iRepresenting a union of true and false tracks, X_i，Y_iAs position coordinates, z-tablesNoise is shown and p represents the number of training sessions.