CN114445465A

CN114445465A - Track prediction method based on fusion inverse reinforcement learning

Info

Publication number: CN114445465A
Application number: CN202210189127.7A
Authority: CN
Inventors: 杨彪; 王姝媛; 杨长春; 徐黎明; 陈阳; 吕继东
Original assignee: Changzhou University
Current assignee: Changzhou University
Priority date: 2022-02-28
Filing date: 2022-02-28
Publication date: 2022-05-06

Abstract

The invention relates to the technical field of pedestrian trajectory prediction and analysis, in particular to a trajectory prediction method based on fusion inverse reinforcement learning, which comprises S1, generating a path rewarding map and an end point rewarding map based on an input observation trajectory and a scene map; s2, obtaining a path by sampling the strategy by using an inverse reinforcement learning algorithm; s3, path position coding is carried out by using a full convolution network, a scene path is coded by fusing a bidirectional gating circulation unit, and the scene path and a pedestrian observation track are fused. According to the method, the ENet network is extracted by introducing the light-weight characteristics, so that the algorithm parameter quantity is reduced, and the generalization capability of the algorithm understanding scene is improved; by using the attention mechanism module of the scene, scene information and pedestrian observation tracks are better fused, and compared with a mainstream algorithm, the scene-oriented pedestrian track prediction network S2Tirl has a better effect on public data sets and actual data.

Description

Track prediction method based on fusion inverse reinforcement learning

Technical Field

The invention relates to the technical field of pedestrian trajectory prediction and analysis, in particular to a trajectory prediction method based on fusion inverse reinforcement learning.

Background

With the continuous improvement of travel demands of people, the increasing demand of traffic system intellectualization, and the continuous development of directions such as computer vision, sensor technology, control theory and the like, at present, the traditional method only relies on observation tracks, and with the increase of the number of tracks, a large number of tracks exist in the same scene, the problem of path selection can occur, and different from the traditional pedestrian track prediction method, the invention uses a reverse reinforcement learning method to code the scene, fuses observation track information, and provides a scene-oriented track prediction network, and in the face of urban scenes and some complex scenes, the traditional method only uses the observation tracks to code, and the utilization of scene information is insufficient, so that the generalization performance of the prediction track is poor in the face of new complex scenes, and the method for fusing scene information provided by the invention trains the capability of processing different elements in the scene, the capability is further transferred to a new scene, so that the generalization of the inferred path is better;

by utilizing a computer vision technology, a researcher can predict the pedestrian track depending on the kinematic characteristics such as the position and the speed of the observed track, the scene characteristics are introduced by the researcher to improve the generalization performance of track prediction, and a foundation is laid for the track prediction of a large number of complex scenes by utilizing an inverse reinforcement learning framework.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: and respectively coding scene path and pedestrian track information by combining the inverse reinforcement learning, the semantic segmentation network and the threshold recursion unit, and outputting a final predicted track after fusion coding.

The technical scheme adopted by the invention is as follows: a track prediction method based on fusion inverse reinforcement learning comprises the following steps:

s1, generating a path reward map and an end point reward map based on the input observation track and the scene graph;

further, the step S1 includes:

s11, introducing an observation track and extracting scene features;

the method comprises the steps that an asymmetric design of an ENet semantic segmentation network structure is utilized, five bottleneck modules including initialization, normal operation, down sampling, up sampling, expansion and asymmetry are adopted, and a scene graph is subjected to coding and decoding processing, so that a scene path is generated, and scene semantic understanding is achieved; meanwhile, the observation track is introduced to calculate the speed v of the intelligent agent and the distance r between the intelligent agent and the scene center to form the motion characteristic phi of the intelligent agent_M(ii) a Fusing the image characteristics and the track characteristic information to generate scene characteristics;

s12, generating a reward map;

the multilayer perceptron neural network (MLP) is still perceptron in nature, also called artificial neural network, which may have a plurality of hidden layers in the middle except input and output layers, the simplest neural network only has one hidden layer, namely a three-layer structure, but the complexity is enhanced by the layer design, and there are three types of layers in the neural network, respectively: an input layer, a hidden layer, an output layer; the method comprises two layers of 2D convolution layers and a ReLU activation function, wherein the last layer is a Log-Sigmoid activation function, and scene characteristics and motion characteristics are input into an MLP (Multi-level layer processing) after being connected_pathAnd MLP_goalNetwork, finally generating path reward map and terminal reward map;

s2, taking the path reward map of S12 as an environment, introducing an inverse reinforcement learning algorithm to obtain an optimal strategy, and introducing Gumbel-softmax Trick to sample the strategy;

further, the S2 includes:

s21, calculating an optimal strategy by using inverse reinforcement learning;

after the path reward map and the end point reward map are obtained in step S1, a strategy pi is found and obtained in a grid environment of 21 × 21 size by using an inverse reinforcement learning algorithm_θ(as), inverse reinforcement learning can be represented by the markov decision process MDPs, which is essentially a quadruple M ═ { S, a, T, r }, where S is the state space, a represents the action taken, T represents the state transition function T: sxa → S, r represents the reward function, and the state-action sequence τ { (S)₁,a₁),(s₂,a₂),...,(s_N,a_N) The probability of being proportional to the reward value index at maximum entropy, expressed in particular as

Where K is a normalization constant, N represents the length of the sequence, r (τ) represents the prize value of the sequence, r(s)_i) Expressing a single state-action pair reward value, obtaining a reward function according to an expert example, and generating a strategy algorithm, wherein the aim is to maximize an expert example set T ═ (tau)₁,τ₂,...,τ_N) The log-likelihood value of (a) is,

solving the log-likelihood value L for using a gradient descent method_θIs simplified to

Wherein r is_θ(τ) sequential reward value, Z, for exploration strategy Generation_θTo explore the strategically learned constants, F_πIs the state access frequency, F, of the expert example τ_θIs strategy pi_θThe state access frequency of the generation path,

and

respectively represent the pair likelihood values L_θAnd a reward function r_θDerivation is carried out;

the probability of the action a is selected given a grid position s obtained by back propagation, wherein all grid positions s in the inverse reinforcement learning environment are s_pThe sum of the path points s_gThe sum of the target points, action a is five actions of up, down, left, right and termination;

s22, using GST strategy to sample;

introduction of GST (Gumbel-s)Soft max check) generates a scene path according to the optimal strategy; different from random sampling in the traditional method, firstly sampling is carried out to make discrete probability distribution meaningful, but not only the value with the maximum probability is taken, secondly gradient is calculated, and strategy pi is achieved by introducing GST_θ ^*Perform a sampling generation action x_πThe specific process is x_π＝softmax(logπ_i+G_i) Wherein G is_iRepresenting the standard Gumbel distribution of independent and same distribution, the Gumbel distribution is a kind of extreme value distribution, and the cumulative distribution function is

G_iCan be obtained from a uniform distribution by inverting the Gumbel distribution;

s3, path position coding is carried out by using a full convolution network, a scene path is coded by fusing a bidirectional gating circulation unit, the scene path and the pedestrian observation track are fused, and the track prediction effect is improved;

further, the step S3 includes:

s31, a full convolution network;

sending the position code into a fusion bidirectional gate control circulation unit (BiGRU) through a full convolution network for time sequence coding, receiving an input scene graph with any size, adopting an anti-convolution layer to up-sample a feature graph of the last convolution layer, and restoring the feature graph to the same size of an input image, thereby generating a prediction for each pixel, simultaneously reserving spatial information in an original input image, and finally performing pixel classification on the up-sampled feature graph with odd and even numbers; after a strategy sampling scene path and an observation track are input, the strategy sampling scene path and the observation track are respectively coded and then sent to a scene-based attention mechanism (SBA) module, a prediction track is output through a decoder, a scene path scene _ path is input, a position code is sent to a fusion bidirectional gating circulation unit by using a full convolution network FC1 to carry out time sequence coding to generate a scene path hidden state code h _ scene, for the observation track obs _ traj, the position code is sent to a GRUenc by using an FC2 to carry out time sequence coding to generate an observation track hidden state code h _ obs, and h _ scene can be obtained as BiGRU (FC)₁(scene_path),w_Bi) And h _ obs ═ GRU_enc(FC₂(obs_traj),w_GRU) Wherein w is_BiAnd w_GRULearnable parameters of BiGRU () and GRUenc () are respectively represented, the full convolution networks FC1 and FC2 are responsible for position coding, and a scene path is coded by using a fusion bidirectional gating circulation unit, so that the priori performance of the scene path is better exerted, and the track prediction effect is improved;

s32, performing time sequence coding on a threshold Recurrent Unit (GRU);

the threshold cycle unit is a variant of a Recurrent Neural Network (RNN), the calculated amount of the threshold cycle unit is smaller, each threshold cycle unit comprises an update gate for controlling information transmission, and the information of the current time and the previous time is processed, so that the current state is controlled, the reset gate is similar to the update gate, but the dependency of the reset gate on the previous time is controlled, therefore, the threshold cycle unit can effectively capture the long-short term relationship of the sequence, and is more suitable for solving the problem of dynamic identification;

s33, fusing a scene-based attention mechanism module;

after the scene path is obtained, introducing a better fusion track and scene information of the attention System (SBA) of the scene by using a multi-head attention system in a transform framework, on the premise of fully retaining the characteristics of the scene path, observing the correlation between the track and the scene path, inputting t-n moment observation track hidden state code h _ obs (t-n) and scene path hidden state code h _ scene, obtaining the code h _ obs (t-n +1) of the next moment,

get h _ obs_t＝n+1Then, GRU is used_decEncoding timing information and decoding using a position decoder FC3

Output predicted position at that time:

the invention has the beneficial effects that:

1. by introducing the light-weight characteristic extraction ENet network, the algorithm parameter quantity is reduced, and the generalization capability of the algorithm understanding scene is improved;

2. GST sampling is used for strategies, so that the real probability distribution of the strategies can be correctly reflected, and the problem of insufficient robustness of random sampling strategies is solved;

3. by using the attention mechanism module of the scene, scene information and pedestrian observation tracks are better fused, and compared with a mainstream algorithm, the scene-oriented pedestrian track prediction network S2Tirl obtains better effects on a public data set and actual data.

Drawings

FIG. 1 is a flow chart of a pedestrian trajectory prediction system based on fusion inverse reinforcement learning according to the present invention;

FIG. 2 is a schematic diagram of the generation of a reward map using an ENet network architecture as proposed in the present invention;

FIG. 3 is a schematic diagram showing the comparison of sampling using an inverse reinforcement learning algorithm and a GST scene strategy proposed in the present invention;

FIG. 4 is a schematic diagram of a scene-based attention mechanism module fusing scene and pedestrian observation trajectory path generation as proposed in the present invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings and examples, which are simplified schematic drawings and illustrate only the basic structure of the invention in a schematic manner, and therefore only show the structures relevant to the invention.

The method comprehensively considers scene path and pedestrian track information, utilizes scene hidden information to increase prediction precision, and integrates the scene path and pedestrian track information to output a final predicted track;

as shown in fig. 1, which is a flow chart of a pedestrian trajectory prediction system based on fusion inverse reinforcement learning, a trajectory prediction method based on fusion inverse reinforcement learning includes the following steps:

by means of inverse reinforcement learning, the motion mode of the pedestrian is accurately understood from the motion track of the pedestrian, and meanwhile scene characteristics are introduced, so that the motion of the pedestrian around the current vehicle is more accurately predicted, and the generalization performance of a track prediction algorithm is improved; the method generates the scene path according to GST (Gumbel-softmax Trick) sampling, and can generate the scene path according to an optimal strategy; finally, information fusion is carried out on the pedestrian track prediction and the scene path, and after fusion is carried out by using a scene-based attention mechanism module, the correlation between the pedestrian track and the scene path is observed on the premise that the scene path characteristics are fully reserved;

the method comprises the following specific operation steps:

FIG. 2 presents a schematic diagram of the generation of a reward map using an ENet network architecture:

s11, introducing an observation track and extracting scene features;

the motion characteristics phi of the intelligent agent are formed by introducing parameters such as the observation track calculation speed v of the intelligent agent, the distance r between the intelligent agent and the scene center and the like_MGenerating motion characteristics, after acquiring pedestrian motion characteristics, in order to improve accuracy, adding scene graph information at the same time, utilizing asymmetric design of a semantic segmentation network ENet network structure, comprising five bottleneck modules of initialization, convention, down sampling, upper adoption, expansion and asymmetry, enhancing the characteristic information of each target in an image by inputting a scene graph, improving the acquisition of effective information by a subsequent semantic segmentation network, and reducing the influence of image noise on the network, thereby improving the identification accuracy of the target, building a semantic segmentation network of a road scene, after accurately acquiring the global information and the local information of each target, distinguishing according to different semantic meanings expressed by each pixel in a scene image, classifying each pixel in the image into image blocks with the size of 168 x 168 pixels at the last moment position in an observation track, ENet_featUsing the initialization module of ENet and the first three layers of bottleneck blocks, the image with 3 × 168 × 168 of dimension is converted into a feature map phi of 128 × 21 × 21_BFusing the feature information of each stage to generate a scene feature phi_B；

S12, generating the reward map

The scene characteristics phi generated in step S11_BAnd exercise ofCharacteristic phi_MMake a connection to input the MLP_pathAnd MLP_goalA network for generating path reward maps and end point reward maps separately, a neural network multi-layer perceptron (MLP) is essentially a perceptron but complexity is enhanced by the design of layers, there are three types of layers in the neural network, respectively: the input layer, the hidden layer and the output layer comprise two 2D convolution layers and a ReLU activation function, the last layer is a Log-Sigmoid activation function, wherein the MLP layer is a multi-layer (MLP) layer_pathThe network provides rewards for each action of generating paths, and can be used for judging the action termination position, MLP_goalProviding reward for the termination of the generation of a path for determining the location of termination of an action, the invention provides a method for determining the location of termination of an action by providing a reward to an ENet_featThe network model parameters are pre-trained by using an aeroscape data set, so that the convergence speed of scene rewarding map training is increased;

fig. 3 shows a comparison between random sampling and gunn-bell sampling:

s21, calculating an optimal strategy by using inverse reinforcement learning;

after the path reward map and the end point reward map are obtained in the step S1, a reverse reinforcement learning algorithm is used for collecting a batch of pedestrian tracks in a grid environment with the size of 21 x 21, the forms of return functions are deduced after the pedestrian tracks are obtained, then behavior strategies are optimized according to the return functions, and finally the strategy pi is obtained through exploration_θ(as), in the method of reasoning the return function, the maximum entropy reverse reinforcement learning is a probabilistic thinking reasoning mode, a sampling method is adopted, a model-free reinforcement learning algorithm is used for learning an optimal strategy under the current reward setting, then the strategy is used for collecting { Tj } to carry out unbiased estimation, on the other hand, the learning of the model-free reinforcement learning algorithm is carried out, an inaccurate strategy is used for similar estimation gradient, importance sampling is used for overcoming the deviation problem, an approximation iterative algorithm is introduced to calculate the optimal strategy pi_θUsing the formula: pi_θ(a | s) ═ exp (Q (s, a) -v (s)), where the state value function v(s) represents the desired prize for state s, and the state-action pair (s, a) is the desired prize; the strategy represents the probability of selecting action a given a grid location s, where all nets in the inverse reinforcement learning environmentGrid position s is s_pThe sum of the path points s_gThe sum of the target points, action a is five actions of up, down, left, right and termination, and the action value Q is the state of the five actions⁽ⁿ⁾(s, a) is assigned, and after operation, the next wheel state function V is assigned^(n-1)(s), finally obtaining a given grid position s, adopting the probability of each action a, and obtaining a strategy pi after N cycles_θN is determined by the policy generation path length;

s22, using GST strategy to sample;

introducing a GST (Gumbel-softmax Trick) method to sample from discontinuous probability distribution, generating a scene path according to an optimal strategy to improve sampling efficiency, and firstly, carrying out strategy pi_θSampling to obtain process x_π＝softmax(logπ_i+G_i) Wherein G is_iRepresenting a standard Gumbel distribution of independent equal distributions, the Gumbel distribution being a kind of extreme value distribution whose cumulative distribution function is

G_iIt can be obtained from a uniform distribution by inverting the Gumbel distribution: g_i＝-log(-log(U_i))，U_iThe discrete probability distribution is meaningful rather than only taking the value with the maximum probability, then the gradient needs to be calculated, and GST can correctly reflect the real probability distribution of the strategy, so that the problem of insufficient robustness of random sampling of the optimal strategy is solved;

FIG. 4 is a schematic diagram of the fusion of scene and pedestrian observation trajectory path generation by the SBA module;

s31 full convolution network

After a strategy sampling Scene path and an observation track are input, the strategy sampling Scene path and the observation track are respectively sent to a Scene-Based Attention mechanism (SBA) module after being coded, a prediction track is output through a decoder, a Scene path Scene _ path is input, the position code is sent to a bidirectional BiGRU by utilizing a full convolution network FC1 to carry out time sequence coding to generate a Scene path hidden state h _ Scene, and the observation path hidden state h _ Scene is obtained by observingThe trajectory obs _ traj is encoded by FC2, and then sent to GRUenc for time-series encoding to generate the motion hidden state h _ obs, so as to obtain h _ scene ═ BiGRU (FC)₁(scene_path),w_Bi) And h _ obs ═ GRU_enc(FC₂(obs_traj),w_GRU) Wherein w is_BiAnd w_GRULearnable parameters respectively representing BiGRU () and GRUenc (), the full convolution networks FC1 and FC2 are responsible for position coding, and the BiGRU is used for coding the scene path, so that the priori performance of the scene path can be better exerted, and the track prediction effect can be improved;

s32, threshold cycle unit time sequence coding

The threshold cycle Unit (GRU) is a variant of Recurrent Neural Network (RNN) and has smaller calculation amount, each threshold cycle Unit comprises an update gate for controlling information transmission to process the information at the current moment and the previous moment so as to control the current state, the reset gate is similar to the update gate, but the side of the reset gate controls the dependency degree of the previous moment, therefore, the threshold cycle Unit can effectively capture the long-short term relation of the sequence and is more suitable for solving the problem of dynamic identification, a formula is generated after the time sequence coding, and w_BiAnd w_GRULearnable parameters of BiGRU () and GRUenc () are respectively represented, the full convolution networks FC1 and FC2 are responsible for position coding, and a bidirectional threshold cycle unit is used for coding a scene path, so that the priori performance of the scene path is better exerted, and the track prediction effect is improved;

s33, fusing a scene-based attention mechanism module;

the method comprises the steps of utilizing a multi-head attention mechanism in a transform framework, introducing a scene-based attention mechanism module to better obtain a fusion track and scene information after obtaining a scene path, inputting a strategy sampling scene path and an observation track, respectively sending the scene path and the observation track into the scene fusion attention mechanism module after encoding, outputting a prediction track through a decoder, and on the premise of fully retaining scene path characteristics, inputting observation track codes h _ obs (t is n) and scene path codes h _ scene to obtain codes h _ obs (t is n +1) of the next moment, wherein h _ obs is softmax (h _ obs)_t＝n*h_scene) h _ scene, yielding h _ obs_t＝n+1Then, GRU is used_decEncoding timing information and using a position decoder FC₃Decoding is carried out

Output predicted position at that time:

in light of the foregoing description of the preferred embodiment of the present invention, many modifications and variations will be apparent to those skilled in the art without departing from the spirit and scope of the invention. The technical scope of the present invention is not limited to the content of the specification, and must be determined according to the scope of the claims.

Claims

1. A track prediction method based on fusion inverse reinforcement learning is characterized by comprising the following steps:

s2, obtaining a path by sampling the strategy by using an inverse reinforcement learning algorithm;

s3, path position coding is carried out by using a full convolution network, a scene path is coded by fusing a bidirectional gating circulation unit, and the scene path and a pedestrian observation track are fused.

2. The trajectory prediction method based on the fusion inverse reinforcement learning of claim 1, wherein the generating of the path reward map and the end point reward map based on the input observation trajectory and the scene graph comprises the following steps:

s11, introducing an observation track and extracting scene features;

encoding and decoding the scene graph by utilizing an ENet semantic segmentation network to generate a scene path; meanwhile, the observation track is introduced to calculate the speed v of the intelligent agent and the distance r between the intelligent agent and the scene center to form the motion characteristic phi of the intelligent agent_M(ii) a Image processing methodFusing the information of the features and the track features to generate scene features;

s12, generating a reward map;

connecting scene characteristics and motion characteristics and inputting the scene characteristics and the motion characteristics into an MLP_pathAnd MLP_goalAnd the network generates a path reward map and an end point reward map respectively.

3. The method for predicting the track based on the fusion inverse reinforcement learning as claimed in claim 2, wherein the step of obtaining the path by sampling the strategy with the inverse reinforcement learning algorithm comprises the following steps:

s21, calculating an optimal strategy by using inverse reinforcement learning;

after a path reward map and an end point reward map are obtained, a strategy pi is obtained by utilizing an inverse reinforcement learning algorithm_θ(as), the sampling probability of the sequence is in direct proportion to the reward of the sequence, which is expressed as

Where K is a normalization constant, N represents the length of the sequence, r (τ) represents the prize value for the sequence, r(s)_i) Representing a single state-action pair reward value; deriving reward functions from expert examples, generating a policy algorithm with the goal of maximizing the set of expert examples, T ═ τ₁,τ₂,...,τ_N) The log-likelihood value of (a) is,

and

s22, using GST strategy to sample;

generating a scene path according to the optimal strategy, and introducing GST to strategy pi_θ ^*Perform a sampling generation action x_πThe specific process is x_π＝soft max(logπ_i+G_i) Wherein, G_iA standard Gumbel distribution representing the same distribution independently, the cumulative distribution function of the Gumbel distribution being

G_iObtained from the uniform distribution by inverting the Gumbel distribution.

4. The method for predicting the trajectory based on the fusion inverse reinforcement learning according to claim 1, wherein the path position is encoded by using a full convolution network, a scene path is encoded by using a fusion bidirectional gating circulation unit, and the fusion of the scene path and the observation trajectory of the pedestrian comprises the following steps:

s31, a full convolution network;

the position coding is sent to a fusion bidirectional gating circulating unit through a full convolution network for time sequence coding, after a strategy sampling scene path and an observation track are input, the position coding is sent to a scene-based attention mechanism module through coding, a prediction track is output through a decoder, a scene path scene _ path is input, and the position coding is carried out by utilizing the full convolution network FC1The codes are sent into a fusion bidirectional gating circulation unit to carry out time sequence coding to generate a scene path hidden state code h _ scene, for an observation track obs _ traj, position codes are sent into GRUenc to carry out time sequence coding after FC2 is used to generate an observation track hidden state code h _ obs, and h _ scene is obtained as biGRU (FC)₁(scene_path),w_Bi) And h _ obs ═ GRU_enc(FC₂(obs_traj),w_GRU) Wherein w is_BiAnd w_GRULearnable parameters representing BiGRU () and GRUenc () respectively;

s32, a Gate Recycling Unit (GRU) time sequence coding;

each threshold cycle unit comprises an update gate for controlling information transmission, and processes information at the current time and the previous time so as to control the current state; resetting the long-short term relation of the sequence effectively captured by the threshold cycle unit at the moment before the gate control;

s33, fusing a scene-based attention mechanism module;

inputting t-n time observation track hidden state code h _ obs (t-n) and scene path hidden state code h _ scene, obtaining code h _ obs (t-n +1) of next time,

Output predicted position at that time: