CN117636651A

CN117636651A - Ramp confluence region mixed traffic flow control method based on space-time diagram neural network reinforcement learning

Info

Publication number: CN117636651A
Application number: CN202311690874.XA
Authority: CN
Inventors: 徐东伟; 邱庆伟; 高光燕; 郭海锋
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2023-12-11
Filing date: 2023-12-11
Publication date: 2024-03-01

Abstract

A ramp confluence region mixed traffic flow control method based on space-time diagram neural network reinforcement learning comprises the following steps: step 1: setting a scene of ramp converging and mixing traffic flow control, and defining matrix representation of environment and self state information perceived by an intelligent agent; step 2: constructing a vehicle map network for each time step in a given time period for representing spatial and temporal correlations of vehicles in the scene; step 3: defining action space and rewarding function of the reinforcement learning agent; defining an action space as a mixed action space of channel changing and acceleration and deceleration based on setting of scenes, and setting a target rewarding function to guide an intelligent agent to learn an optimal action strategy; step 4: training and testing the model; the model was trained using empirical playback and a target network, after which the performance of the reinforcement learning agent model was tested in different traffic conditions. The invention realizes the efficient control of CAV in the mixed traffic flow in the ramp converging traffic bottleneck region.

Description

Ramp confluence region mixed traffic flow control method based on space-time diagram neural network reinforcement learning

Technical Field

The invention belongs to the crossing field of intelligent traffic control and graph neural network reinforcement learning, and relates to a ramp confluence region mixed traffic flow control method based on space-time graph neural network reinforcement learning.

Background

With the continuous development of artificial intelligence and deep learning technology, the intelligent internet-of-vehicles (CAV) control method based on reinforcement learning has remarkable effect on improving traffic efficiency and driving experience in actual traffic scenes. CAV is one of the important components of Intelligent Transportation System (ITS), and will replace the current manual driving vehicles (HDV) comprehensively, but CAV has many defects at present, so that the CAV can only be applied on a small scale, and therefore, the traffic flow of ITS is a mixed scene of CAV and HDV for a long time in the future. Therefore, how to realize the cooperative sensing and driving control of CAV in the mixed traffic flow is an important problem to be solved by the intelligent traffic system.

Existing CAV control strategies are mainly classified into two categories, rule-based methods and learning-based methods. The traditional rule-based method comprises a rule-based IDM following model and a MOBIL lane change model, and the intelligent control of the vehicle is achieved by carrying out some necessary modifications on parameters, but the method has poor flexibility and generalization and cannot be suitable for complex dynamic traffic scenes; with the development of big data and deep learning, the rule-based method is gradually replaced by the learning-based method, and the Deep Neural Network (DNN) has strong feature extraction and nonlinear fitting capability and is widely used in the CAV control model, but the advantages and disadvantages of the DNN model are easily affected by the quality of the training data. In recent years, deep Reinforcement Learning (DRL) based on trial and error has become a mainstream CAV control method due to its good performance and robustness.

Disclosure of Invention

In order to overcome the defects of the prior art in efficiency and robustness, the invention provides a ramp merging region mixed traffic flow control method based on space-time diagram neural network reinforcement learning, which realizes the efficient control of CAV in a ramp merging traffic bottleneck region mixed traffic flow; the ramp converging region is used as an application scene, so that the decision of CAV can be efficiently generated, the decision comprises the control of the acceleration and deceleration of the ramp change, and the efficiency of the traffic bottleneck region is further optimized.

The technical scheme adopted for solving the technical problems is as follows:

a ramp confluence region mixed traffic flow control method based on time sequence diagram neural network reinforcement learning comprises the following steps:

step 1: setting a scene of ramp converging and mixing traffic flow control, and defining matrix representation of environment and self state information perceived by an intelligent agent; building a ramp converging region traffic scene based on an SUMO simulation platform, respectively setting a behavior mode and a control strategy of HDV and CAV, and setting a state space as a time sequence of state information, an adjacent matrix and a mask matrix of a vehicle in a certain time period according to input defined by a model;

step 2: constructing a vehicle map network for each time step in a given time period for representing spatial and temporal correlations of vehicles in the scene; based on the space-time diagram data of the vehicle, constructing a space diagram neural network, a time diagram neural network and a strategy network to be combined into an overall network architecture;

step 3: defining action space and rewarding function of the reinforcement learning agent; defining an action space as a mixed action space of channel changing and acceleration and deceleration based on setting of scenes, and setting a target rewarding function to guide an intelligent agent to learn an optimal action strategy;

step 4: training and testing the model; the model was trained using empirical playback and a target network, after which the performance of the reinforcement learning agent model was tested in different traffic conditions.

The technical conception of the invention is as follows: complex and dynamically-changed traffic scene data are expressed in a space-time diagram form so as to solve the problem of interaction and coordination of multiple agents, the association characteristics of vehicles in space are extracted through a graph convolutional neural network (GCN), the association characteristics of the vehicles in the time dimension are obtained through a gate control circulation unit (GRU), and the aggregated space-time characteristics are used as the output of a reinforcement learning controller. And the controller outputs the mixed actions of CAV lane changing and acceleration and deceleration to realize the CAV efficient and safe control in the ramp converging scene.

The beneficial effects of the invention are as follows: and constructing a complex traffic scene as time sequence diagram data, fully excavating deep space-time characteristics among vehicles by combining a GCN and a GRU network, and obtaining strategies for executing actions of the vehicles in different states by utilizing a reinforcement learning framework. The hybrid action space provided by the invention has stronger flexibility, higher efficiency and better generalization capability, and can be better adapted to complex tasks and traffic scenes so as to meet the requirements of practical application.

Drawings

Fig. 1 is a diagram of a ramp merge scenario.

FIG. 2 is a diagram data construction in temporal and spatial form.

FIG. 3 is a block diagram of a time space diagram based neural network reinforcement learning model.

FIG. 4 is a training prize comparison for different methods, where (a) is a prize comparison graph for the method of the present invention using a hybrid action space and using a single action space, and (b) is a prize comparison for the method of the present invention and other methods using a hybrid action space.

Fig. 5 is a comparison of rewards at different traffic condition testing stages.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

Referring to fig. 1 to 5, a ramp confluence region mixed traffic flow control method based on time sequence diagram neural network reinforcement learning includes the following steps:

step 1: setting a scene of ramp converging and mixing traffic flow control, and defining matrix representation of environment and self state information perceived by an intelligent agent, wherein the process is as follows:

step 1.1: setting a task scene.

The traffic scene designed on the SUMO simulation platform is shown in figure 1, wherein the traffic scene consists of a main road with three lanes and a ramp with a single lane. The total length of the main road is L, the position of the converging ramp entrance at the length L1 of the main road is set as PreMerge, the road section before the main road ramp entrance is set as OnRamp, the ramp end is set as OnRamp, the length of the converging road section is L2, the length of the converging road section Mainline is L3, and the length of the road section PostrMerge after converging is L4. The speed limit of the road section is v _max Wherein the highest velocity of CAV is v _c The highest velocity of HDV is v _h . The traffic flow is injected into the road network from a main road and a ramp, wherein the main road is only an HDV vehicle, the ramp is a mixed traffic flow of the HDV and the CAV, and the initial states of the CAV and the CAV comprise initial speeds, initial lanes and initial positions which are all random values. The flow rate of the main road and the ramp are respectively f _h And f _m Indicating that the ratio of CAV is set to a fixed value Rate _cav The control of HDV uses an IDM (integrated digital signal processor) following model built in SUMO (speeded up object model) and an LC2013 lane change model to control longitudinal and transverse actions, and CAV uses an intelligent agent based on reinforcement learning to control.

Step 1.2: a state space of the agent is defined.

At each time step, the driving information of the vehicle i (HDV or CAV) in the traffic scene is expressed as the vehicle stateIncluding the current location, speed, lane in which the vehicle i is currently located. Normalizing the observed longitudinal position and speed of the vehicle:

wherein the position and speed of the vehicle on the road segment OnRamp are represented by corresponding values projected on the main road with their actual values:

lane information is represented using One-hot encoding, main road Respectively representing a ramp, a lane at the right side of a main road, a lane at the middle of the main road and a lane at the left side of the main road;

constructing a time sequence state space, setting a time period as T, and representing a matrix used in the state space at the time T as S _t ＝{s _t-T+1 ，s _t-T ，…，s _t State at each time is denoted s _t ＝(X _t ，A _t ，M _t ) WhereinThe method respectively represents a vehicle state set, an adjacent matrix and a mask matrix, N represents the maximum number of vehicles in a scene, and F represents the characteristic dimension of each vehicle. Adjacency matrix A _t Defined as an N x N matrix, modeling the vehicle interconnection network as an undirected graph, and using matrix element a _jk In the figure, 1 indicates that the vehicle j and the vehicle k have a connection relationship, and the CAV always has a connection relationship, and the CAV and the HDV only have a connection relationship within a certain sensing range.

Step 2: constructing a space-time diagram network of a traffic scene, extracting space-time characteristics, and performing the following processes:

step 2.1: a timing diagram network is constructed.

Setting the period as T, and the time chart data at the moment of T is denoted as G _t ＝{g _t-T+1 ，g _t-T ，…，g _t A graph of one of the time slices is denoted g _t ＝{V _t ，E _t }, wherein V _t Representing a set of vehicle nodes at each time, E _t Edge matrix representing correlation between vehiclesFor vehicles i and j, based on the current locationAnd->Defining the connection relationship:

where ρ represents the sensing range of the vehicle sensor. When the vehicles i and j are CAV, the CAV can communicate with each other without sensing to obtain the information of the other party without being limited by the sensing rangeAlways 1. The adjacency matrix for each time slice vehicle network graph is represented as:

finally obtaining the time-space diagram network data of the traffic scene shown in fig. 2;

step 2.2: extracting space-time characteristics and constructing a strategy network;

first, the number of layers Z is used _m Dimension d _m The multi-layer perceptron MLP is used as an encoder, each layer of fully connected network of the MLP uses a ReLU activation function, and a vehicle state set X of each time slice in one period _t Coding to obtain node embedded X' _t ：

Using a composition comprising z _G A layer with dimension d _G The graph roll neural network GCN of the graph roll lamination of (1) extracts vehicle space related features of the embedded nodes to obtain a graph roll including deep spaceNode of associated featureThe formula is as follows:

wherein the method comprises the steps of Representation->Sigma represents the activation function ReLu, W is the weight of the neural network, and b is the bias;

constructing a gate control circulation unit GRU network for extracting time sequence characteristics, and acquiring time sequence relevance, wherein the formula of the GRU network is as follows:

wherein W is _z 、W _r 、W _h 、U _z 、U _r 、U _h Is a parameter which can be learned, b _z 、b _t 、b _h Is a weight parameter, σ represents an activation function ReLu,is obtained by GCN operation of the last time step hidden layer output, namely +.>The "" indicates a matrix dot product;

constructing a strategy network P consisting of full connection layers, which is used for calculating and judging the value Q of the action of the intelligent agent, using Relu as an activation function, wherein the output dimension is equal to the action space of the intelligent agent, and the whole network comprising the encoder, the time sequence feature extraction neural network and the strategy network P can be regarded as a Q value network Q _θ Where θ represents the set of ownership weights

Embedding node X 'in which the output of the last GRU and the vehicle state of the current time step at the state input of the policy network' _t And splicing the vectors. Based on the selection of the maximum Q value, the optimal action decision a can be obtained _t+1 The framework of the overall model is shown in fig. 3.

Step 3: defining action space and rewarding function of the reinforcement learning agent, the process is as follows:

step 3.1: defining an action space of the agent.

At each time step, the agent calculates the optimal action in the current state based on the current observed state. According to the requirements of the task, the CAV actions are defined as a mixed action space set, including channel change and acceleration, which is expressed as a _i ＝{R ₁ ，R ₂ ，R ₃ ，K ₁ ，K ₂ ，K ₃ ，L ₁ ，L ₂ ，L ₃ R, K, L are each as defined inIndicating lane change to the right, lane keeping, lane change to the left, subscripts 1, 2, 3 denote acceleration, deceleration, keeping, respectively, acceleration and deceleration being achieved by changing the acceleration or deceleration value of the vehicle, each time with a fixed step k, the motion space of the agent output is a set of all CAV motionsWherein n is the CAV number at the current moment;

step 3.2: a reward function of the agent is defined.

In reinforcement learning, the definition of the reward function has important influence on the quality of strategies learned by an agent, and the design of the reward function considers rewards and penalties of different categories, including the efficiency R _e Task object R _g Driving comfort R _c Safety R _s . Efficiency prize R _e Mainly taking the CAV speed as a measurement standard, the formula is as follows:

R _e ＝w ₀ R _speed +w ₁ P _st #(11)

wherein R is _speed ，P _st Respectively representing a speed reward and a parking penalty, N representing the number of CAVs in the scene,representing the speed, v, of CAV numbered i _max Is the maximum speed set by CAV; n (N) _st Representing the number of CAVs having a speed of less than 5m/s, N _obs Representing the number of CAVs currently observed; w (w) ₀ ，w ₁ Is a weight parameter;

the task in the ramp converging scene is to control the ramp CAV to smoothly converge toMain road, task target R _g The formula is as follows:

the index of the driving comfort evaluation is CAV acceleration, and the formula is as follows:

where acc represents the acceleration value at the current time of CAV.

Safety R _s With the CAV lane change frequency as an evaluation standard, frequent lane change of the vehicle is one of main reasons for causing safety accidents, so that the CAV with frequent lane change is punished, and the formula is as follows:

wherein N is _dra CAV number, N of channel changing actions _obs Indicating the number of CAV currently observed, at time t, the total rewards R of the agent _t Expressed by the formula:

R _t ＝w _e R _e +w _g R _g +w _c R _c +w _s R _s

wherein w is _e ，w _g ，w _c ，w _s The weight parameter is a weight parameter, and different values can be known according to the requirements of the task scene;

step 4: model training and testing, training a model using Q-Learning based on experience playback (Experience replay) and a target network, generating experience(s) in agent-environment simulation interactions _t ，a _t ，r _t ，s _t+1 ) After that, the batch is randomly fetched from the experience playback pool at intervals of u as m _b Learning of empirical data of (a), the training is aimed at minimizing the loss function L _θ ：

After the trained model is obtained, the performance of the model is tested in different traffic conditions, and the performance of the model is evaluated by using the accumulated rewards of each task round.

The implementation process of the data in the practical experiment of this embodiment is:

(1) Experimental scenario set-up

The experimental scene is a ramp converging traffic scene built based on an SUMO simulation platform, wherein the main road is three lanes, the total length of a road section is 800 meters, the road section is divided into three sections, and the three sections are respectively 200 meters, 400 meters and 200 meters before converging, in a converging region and after converging; the ramp is a single-lane road section with the length of 200 meters. Vehicles randomly enter the road network from the main road and the ramp, the vehicles of the main road are all HDVs, and the set number is 75; the ramp is a mixed flow of CAV and HDV, the number of CAV and HDV is 12 and 13 respectively, and the flow rate is set to 0.5/second in the training phase.

(2) Parameter determination

The deep reinforcement learning model is built based on Pytorch and PyTorch Geometric, the model consists of four sub-modules, an embedded layer encoder, a spatial feature extraction module (GCN), a temporal feature extraction module (GRU) and a strategy network, and detailed parameters of each module are as follows: the embedded layer encoder is composed of two layers of fully connected networks, and the nerve units of the hidden layer are respectively 32 and 64; the number of nerve units of the spatial feature extraction module and the time feature extraction module is 64; the number of policy network layers is 3, and the number of hidden layer units is 128, 63 and 32 respectively. The deep reinforcement learning training parameters are as follows: first 1×10 ⁴ The steps are random exploration phase, and the size of the experience playback pool is 1 multiplied by 10 ⁶ The batch size of the data taken out of the experience playback pool is 32, and the total training round number is 100 rounds; adam was used as a mouldModel parameter optimizer, learning rate is 1×10 ^-4 Every 1000 steps of updating the target Q network, carrying out parameter updating by using a soft updating mode, wherein the updating rate is 1 multiplied by 10 ^-1 。

(3) Experimental results

The invention aims to control the CAV of the mixed traffic flow to complete the ramp converging task, and experiments are carried out in a ramp converging simulation environment based on an SUMO platform. Using the model of the present invention, the hybrid motion space training rewards are compared to a single motion space, as shown in FIG. 4a, where the hybrid motion space has a higher rewards value than the single motion space after model convergence. By uniformly using the mixed action space, compared with the common reinforcement learning method and the rule-based method, the training rewards are shown in fig. 4, the method of the invention converges more quickly, and the rewards after convergence have higher values than the other two methods. The rewards of the test procedure are compared with those shown in fig. 5, and the rewards of the method of the invention are generally more stable in different traffic flows than other methods, and are less affected by environmental changes.

Claims

1. The ramp confluence region mixed traffic flow control method based on space-time diagram neural network reinforcement learning is characterized by comprising the following steps of:

2. The method for mixed traffic flow control in ramp merge area based on reinforcement learning of space-time diagram neural network as set forth in claim 1, wherein the process of step 1 is as follows:

step 1.1: setting a task scene;

the method comprises the steps of designing a traffic scene on a SUMO simulation platform, wherein the traffic scene consists of a main road with three lanes and a ramp with a single lane, the total length of a main road is L, a position of a ramp entrance on the main road is L1, a road section before the ramp entrance of the main road is PreMerge, a ramp end is OnRamp, the length of the ramp is L2, the length of a converging road section Mainline is L3, the length of a road section PostrMerge after converging is L4, and the speed limit of the road section is v _max Wherein the highest velocity of CAV is v _c The highest velocity of HDV is v _h The method comprises the steps of carrying out a first treatment on the surface of the The traffic flow is injected into the road network from a main road and a ramp, wherein the main road only has HDV vehicles, the ramp is mixed traffic flow of HDV and CAV, and the initial states of the CAV and the CAV comprise random values of initial speed, initial lanes and initial positions; the flow rate of the main road and the ramp are respectively f _h And f _m Indicating that the ratio of CAV is set to a fixed value Rate _cav The control of HDV uses an IDM (integrated digital matrix) following model and an LC2013 lane change model built in SUMO (speeded up matrix) to control longitudinal and transverse actions, and CAV uses an intelligent agent based on reinforcement learning to control;

step 1.2: defining a state space of the intelligent agent;

at each time step, the driving information of the vehicle i, i.e. HDV or CAV, in the traffic scene is expressed as the vehicle stateThe method comprises the steps of normalizing the observed longitudinal position and speed of the vehicle, wherein the current position, speed and current lane of the vehicle i are included:

constructing a time sequence state space, setting a time period as T, and representing a matrix used in the state space at the time T as S _t ＝{s _t-T+1 ,s _t-T ,…,s _t State at each time is denoted s _t ＝(X _t ,A _t ,M _t ) WhereinRespectively representing a vehicle state set, an adjacent matrix and a mask matrix, wherein N represents the maximum number of vehicles in a scene, F represents the characteristic dimension of each vehicle, and the adjacent matrix A _t Defined as an N x N matrix, modeling the vehicle interconnection network as an undirected graph, and using matrix element a _jk In the figure, 1 indicates that the vehicle j and the vehicle k have a connection relationship, and the CAV always has a connection relationship, and the CAV and the HDV only have a connection relationship within a certain sensing range.

3. The method for mixed traffic flow control in ramp merge area based on reinforcement learning of space-time diagram neural network as claimed in claim 1 or 2, wherein the process of step 2 is as follows:

step 2.1: constructing a time sequence diagram network;

setting the period as T, and the time chart data at the moment of T is denoted as G _t ＝{g _t-T+1 ,g _t-T ,…,g _t A graph of one of the time slices is denoted g _t ＝{V _t ,E _t }, wherein V _t Representing a set of vehicle nodes at each time, E _t A borderline matrix representing the correlation between vehicles, based on the current position for vehicles i and jAnd->Defining the connection relationship:

wherein ρ represents the sensing range of the vehicle sensor, when the vehicles i and j are CAV, the CAV can communicate with each other without sensing to obtain the information of the other party without being restricted by the sensing range, and the vehicle i and j are the CAVAlways 1, the adjacency matrix of the vehicle network graph per time slice is expressed as:

finally obtaining traffic scene time-space diagram network data;

first, the number of layers Z is used _m Dimension d _m The multi-layer perceptron MLP is used as an encoder, each layer of fully connected network of the MLP uses a ReLU activation function, and a vehicle state set X of each time slice in one period _t Coding to obtain node embedded X _t ′：

Using a composition comprising z _G A layer with dimension d _G The graph roll neural network GCN of the graph roll lamination of (1) extracts vehicle space related features of the embedded nodes to obtain nodes containing deep space related featuresThe formula is as follows:

wherein the method comprises the steps of Representation->Sigma represents the activation function ReLu, W is the weight of the neural network, b is the bias

constructing a strategy network P consisting of full connection layers, which is used for calculating and judging the value Q of the action of the intelligent agent, using Relu as an activation function, and enabling the output dimension to be equal to the action space of the intelligent agent, wherein the strategy network P comprises an encoder and a time sequence feature extraction nerveThe overall network including the network and the policy network P can be regarded as a Q-value network Q _θ Where θ represents the set of ownership weights

Embedded node X in which the output of the last GRU and the vehicle state of the current time step at the state input of the policy network _t Vector formed by splicing, and optimal action decision a can be obtained based on selection of maximum Q value _t+1 。

4. The method for mixed traffic flow control in ramp merge area based on reinforcement learning of space-time diagram neural network as claimed in claim 1 or 2, wherein the process of step 3 is as follows:

step 3: defining action space and rewarding function of the reinforcement learning agent:

step 3.1: defining an action space of the intelligent agent;

each time step, the agent calculates the optimal action in the current state based on the current observation state, and defines the CAV action as a mixed action space set comprising channel changing and acceleration according to the requirement of the task, which is expressed as a _i ＝{R ₁ ,R ₂ ,R ₃ ,K ₁ ,K ₂ ,K ₃ ,L ₁ ,L ₂ ,L ₃ R, K, L respectively indicate lane change to the right, lane keep, lane change to the left, subscripts 1, 2, 3 respectively indicate acceleration, deceleration, keeping, acceleration and deceleration being achieved by changing the acceleration or deceleration value of the vehicle, each time with a fixed step K, the motion space of the agent output is the set of all CAV motionsWherein n is the CAV number at the current moment;

step 3.2: defining a reward function of the agent;

in reinforcement learning, the definition of the reward function has an important effect on the quality of the strategy learned by the agent, and the design of the reward function considers rewards and penalties of different categories, including the efficiency R _e Task object R _g Driving comfort R _c Safety R _s The method comprises the steps of carrying out a first treatment on the surface of the Efficiency prize R _e The CAV speed is used as a measurement standard, and the formula is as follows:

R _e ＝w ₀ R _speed +w ₁ P _st #(11)

the task in the ramp converging scene is to control the ramp CAV to smoothly converge to the main ramp, and the task target R _g The formula is as follows:

wherein acc represents the acceleration value of CAV at the current moment;

R _t ＝w _e R _e +w _g R _g +w _c R _c +w _s R _s

wherein w is _e ，w _g ，w _c ，w _s Is a weight parameter, and different values can be known according to the requirements of the task scene.

5. The method for mixed traffic flow control in ramp merge area based on reinforcement learning of space-time diagram neural network as claimed in claim 1 or 2, wherein the process of step 4 is as follows:

step 4: model training and testing, training a model using Q-Learning based on experience playback and a target network, generating experience(s) at agent-environment simulation interactions _t ,a _t ,r _t ,s _t+1 ) After that, the batch is randomly fetched from the experience playback pool at intervals of u as m _b Learning of empirical data of (a), the training is aimed at minimizing the loss function L _θ ：