CN113487902B

CN113487902B - Reinforced learning area signal control method based on vehicle planned path

Info

Publication number: CN113487902B
Application number: CN202110534127.1A
Authority: CN
Inventors: 王昊; 卢云雪; 董长印; 杨朝友
Original assignee: Yangzhou Fama Intelligent Equipment Co ltd; Southeast University
Current assignee: Yangzhou Fama Intelligent Equipment Co ltd; Southeast University
Priority date: 2021-05-17
Filing date: 2021-05-17
Publication date: 2022-08-12
Anticipated expiration: 2041-05-17
Also published as: CN113487902A

Abstract

The invention discloses a reinforcement learning area signal control method based on a vehicle planned path, which is specifically characterized in that under the environment of an internet of vehicles, planned path information and position information of all vehicles at intersections in an intelligent agent control range are collected, and distributed signal control is carried out on road intersections in an area by utilizing a reinforcement learning PPO2 algorithm, so that the linkage optimization of area traffic is realized. Particularly, a control framework for controlling a regional traffic signal by multi-agent reinforcement learning is provided; defining a road traffic state based on the vehicle planned path information and the vehicle position information; defining an action variable for controlling an intersection signal; the intelligent agent and traffic environment interactive reward is defined by taking the aims of reducing the queuing length of an intersection, reducing vehicle delay and avoiding downstream traffic jam as targets; meanwhile, a distance factor is provided for measuring the distance between a control scheme generated by the PPO2 algorithm and a queuing priority length priority strategy generation scheme, and abnormal road traffic disturbance caused by a poor control method output by the PPO2 algorithm is avoided.

Description

Reinforced learning area signal control method based on vehicle planned path

Technical Field

The invention belongs to the field of traffic management and control, and particularly relates to a reinforcement learning area signal control method based on a vehicle planned path.

Background

In the internet of vehicles environment, vehicles exchange information including local paths of the vehicles, vehicle position information, vehicle speeds and the like with roadside facilities in real time through vehicle-mounted equipment. Reinforcement learning based signal control methods typically incorporate vehicle position and speed information into the algorithm inputs to develop a more accurate signal control scheme. However, the local route information of the vehicles in the car networking environment is generally easily obtained and is information that can reflect the distribution of the vehicles on the road network and the traffic flow, but cannot be used in the signal control. In addition, when the existing multi-agent reinforcement learning algorithm is used for carrying out distributed control on regional traffic, a single intersection is usually used as an independent agent, and the feedback obtained after the agent generates a signal control scheme usually only considers the vehicle queuing condition or delay of the intersection, but the design mode is not beneficial to the combined control of the regional traffic. In addition, existing reinforcement learning models typically require interaction with a traffic simulator, such as SUMO, through a pre-stage, to accumulate data for model training. However, the traffic simulation environment is different from a real traffic system, and when the reinforcement learning model trained by the simulator is transferred to the actual traffic environment, the control effect of the model is poor.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the problems, the invention provides a reinforcement learning area signal control method based on a vehicle planning path; in the reinforcement learning algorithm, the local planned path of the vehicle is used as the reference information of the signal control scheme as state input, so that the overall state and trend of the regional traffic can be effectively controlled, the predictability of the control scheme is improved, and the optimization of the overall traffic running state of the region is facilitated.

The technical scheme is as follows: in order to realize the purpose of the invention, the technical scheme adopted by the invention is as follows: a reinforcement learning area signal control method based on a vehicle planned path comprises the following steps:

step 1, designing a control framework of an intelligent agent in traffic signal control of a target area, and modeling a road traffic state, wherein the control framework comprises the following steps: taking each intersection in the target area as an independent agent, and constructing a respective corresponding reinforcement learning control model and a database for each independent agent;

step 2: enabling an independent intelligent agent at the intersection to interact with the environment of the intersection, and collecting road traffic state information within a certain range of the intersection in real time; the certain range includes the intersection and an entrance lane of an adjacent intersection;

step 3, taking the road traffic state information of the intersection at the current moment as the input of a reinforcement learning control model corresponding to the intersection to obtain an intersection signal control scheme at the next moment of the current moment and an evaluation result of the control scheme; the signal control scheme comprises a release phase and a green time;

step 4, generating an intersection signal control scheme at the next moment of the current moment by using a queuing priority length strategy according to the road traffic state information of the intersection at the current moment;

step 5, calculating a distance factor by using an intersection signal control scheme obtained by the reinforcement learning control model and an intersection signal control scheme generated by a queuing priority length strategy; if the calculated distance factor is larger than the set distance threshold, implementing an intersection signal control scheme generated by a queuing priority length strategy at the intersection; otherwise, implementing an intersection signal control scheme obtained by a reinforcement learning control model at the intersection;

and 6, storing the road traffic state information collected by the intersection intelligent bodies in the target area, the signal control schemes corresponding to the intersections and the rewards of the intersection intelligent bodies and the environment interaction into databases corresponding to the intersections in real time, updating the reinforcement learning control model parameters corresponding to the intersections when the data information stored in the intersection database is judged to be accumulated to a set size, emptying all data in the database after the updating is finished, and returning to the step 2.

Furthermore, the road traffic state information in the step 2 is a set formed by a vehicle planning path matrix, a vehicle position matrix, a corresponding relation vector of a lane and a road section and a green time vector, and the overall state and trend of regional traffic can be effectively grasped;

distribution for vehicle planned path matrix _m×n×4 Representing, wherein each row corresponds to one lane, the lanes in the monitoring range of the intelligent agent are separated by 1 meter to obtain a plurality of cells, and each column corresponds to one cell; if a vehicle exists in the kth cell of the lane i at the time t, storing four planned road segment numbers which may be passed by the vehicle after the time t in Distribution (i, k,1), Distribution (i, k,2), Distribution (i, k,3) and Distribution (i, k,4) respectively;

pos for the vehicle position matrix _m×n×1 Representing, wherein each row corresponds to a lane within the agent monitoring range; each column corresponds to one cell; if a vehicle exists in the kth cell of the lane i at the time t, Pos (i, k) is 1; if no vehicle exists in the kth cell of the lane i at the time t, Pos (i, k) is 0;

the corresponding relation vector of the lane and the road section is I _m×1 Represents; then I _i A link number indicating the lane i is located;

the green time vector is G _m×1 Represents; then G is _i Representing the remaining green light transit time of the current cycle in which lane i is located at time t.

Further, the calculation formula of the distance factor in step 5 is as follows:

wherein γ is a distance factor;

an intersection signal control scheme obtained for a reinforcement learning control model;

for queue length priority policy generationAnd (3) forming a signal control scheme.

Further, in step 6, the road traffic state information collected by the intersection intelligent agents in the target area, the signal control schemes corresponding to the intersections respectively and rewards for interaction between the intersection intelligent agents and the environment<s _t ,a _t ,r _t+1 ,s _t+1 >Storing the form of the data into a database corresponding to each cross port; wherein s is _t Representing road traffic state information collected by the intelligent agent at the intersection at the moment t; a is _t Representing a signal control scheme implemented at the intersection at the time t; r is _t+1 A reward representing interaction of the intersection agent with the environment at time t + 1; s _t+1 And the road traffic state information collected by the intersection intelligent agent at the moment t +1 is represented.

Further, the reward of interaction between the intelligent agent and the environment at the intersection is obtained by calculating the waiting time of the first vehicle at the intersection entrance lane, the queuing length of the intersection entrance lane and the queuing length of the intersection exit lane, and specifically comprises the following steps:

in the formula, r _t+1 Awarding interaction between the intelligent agent at the intersection and the environment at the time t + 1; l _in 、l _out Respectively an inlet channel and an outlet channel of the intersection; w is a _i 、q _i Respectively the first vehicle waiting time and the queuing length of the lane i; f. of _j Is a Boolean variable used for measuring whether the queuing length of the outlet passage exceeds the length L of the road section _j Three fourths of the total area of the coating, if

Then f is _j 1, otherwise f _j ＝0；L _j Is the road segment length of lane j; q. q.s _j The length of the queue for lane j; 0 is a penalty factor.

Further, step 6, when the data information stored in the intersection database is accumulated to the set size, updating the reinforcement learning control model parameter corresponding to the intersection, and emptying all data in the database after the update is completed, where the method includes:

step 6.1, initializing reinforcement learning control model parameters, including:

initializing values of the hyper-parameters, including a learning rate alpha, a threshold sigma of a distance factor and a penalty factor delta;

is an action model Actor _θ Parameter(s) and evaluation model Critic _w The parameters of (a) are assigned with initial values, wherein theta and w are respectively parameters of the action model to be updated and the evaluation model;

define Actor _ old _θ′ 、Critic_old _w′ Is Actor _θ 、Critic _w Copies of the model, Actor _ old _θ′ The parameters of the model are equal to Actor _θ Parameters when not updated are kept unchanged in the updating process;

is Actor _θ And Critic _w Setting training times n _ operator and n _ critical;

step 6.2, utilize all data x in the database _t ＝<s _t ,a _t ,r _t+1 ,s _t+1 Update an action model in a reinforcement learning control model, comprising:

step 6.21, calculate A(s) _t ,a _t )＝r _t+1 +τV _w (s _t+1 )-V _w (s _t )

In the formula, V _w (s _t+1 ) To evaluate the model Critic _w The evaluation result is output at the moment t + 1; v _w (s _t ) The evaluation result is output by the evaluation model at the time t; τ is a reduction factor; a(s) _t ,a _t ) For information on road traffic conditions s _t Lower implementation Signal control scheme a _t The advantages of (a);

step 6.22, calculating Actor _θ Gradient of the model:

wherein E represents a mathematical expectation; (s) _t ,a _t )～π _θ′ To representThe data used is by Actor _ old _θ′ Obtaining a model; p _θ (a _t |s _t )、P _θ′ (a _t |s _t ) Is an action model Actor _θ 、Actor_old _θ′ Information s on road traffic conditions _t Lower implementation Signal control scheme a _t The probability of (d);

representing the derivation of the parameter theta;

6.23, updating the parameter theta according to an Adam optimization method;

step 6.24, repeat step 6.22-step 6.23n _ actor times;

step 6.3, all data x in the database are utilized _t ＝<s _t ,a _t ,r _t+1 ,s _t+1 The method for updating the evaluation model in the reinforcement learning control model comprises the following steps:

step 6.31, calculate A(s) _t ,a _t )＝r _t+1 +τV _w′ (s _t+1 )-V _w (s _t )；

In the formula, V _w′ (s _t+1 ) Represents the evaluation model Critic _ old _w′ The output result of (2);

step 6.32, calculating an evaluation model Critic _w Gradient (2):

in the formula (I), the compound is shown in the specification,

representing the derivation of the parameter w;

step 6.33, updating the parameter w according to an Adam optimization method;

step 6.34, repeat step 6.31-step 6.33n _ critical times;

and 6.4, emptying all data information in the database.

Has the advantages that: compared with the prior art, the technical scheme of the invention has the following beneficial technical effects:

the intersection and the adjacent intersections in the target area are taken as monitoring objects, and regional vehicle planning path data under the Internet of vehicles environment are considered in the state variables, so that the road traffic state is represented more comprehensively; constructing a regional signal control reinforcement learning model by combining a PPO algorithm and an LSTM algorithm; the distance factor is provided to measure the distance between the reinforcement learning model and the control scheme generated by the traditional queuing length priority strategy, so that the influence of the untrained mature model on road traffic safety and traffic efficiency due to the generation of an improper signal control scheme in the online learning process can be effectively avoided.

Drawings

FIG. 1 is a schematic diagram of modeling a lane condition, under an embodiment;

FIG. 2 is a diagram of a road traffic condition result for a lane under one embodiment;

FIG. 3 is a schematic illustration of an intersection in one embodiment;

FIG. 4 is a schematic structural diagram of a PPO2 model according to an embodiment;

FIG. 5 is a logic flow diagram of a method of the present invention in one embodiment.

Detailed Description

The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

Referring to fig. 5, the method of the present invention is further illustrated by taking an intersection as an example. The invention relates to a reinforcement learning area signal control method based on a vehicle planned path, which specifically comprises the following steps:

(1) designing a control framework of an intelligent agent in regional traffic signal control, modeling a road traffic state, and comprising the following steps:

under the environment of the Internet of vehicles, the regional path and the vehicle position information of the vehicle are fully utilized, so that the road traffic state is more comprehensively grasped and analyzed.

Specifically, each intersection is used as an independent agent, the intersection and the entrance lane of the adjacent intersection are used as observation ranges, and planning path information and vehicle position information of vehicles in the ranges are collected.

Defining the regional vehicle planning path matrix as Distribution _m×n×4 Each row corresponds to one lane, and the lanes in the intelligent agent monitoring range are divided into 1 meter of cells; if a vehicle exists in the kth cell of the lane i at the time t, storing four planned road segment numbers after the time t of the vehicle in Distribution (i, f,1), Distribution (i, k,2), Distribution (i, k,3) and Distribution (i, k,4) respectively;

defining a vehicle position matrix as Pos _m×n×1 Each row of the matrix corresponds to a lane in the monitoring range, the lane takes 1 meter as a cell, and if a vehicle exists in the kth cell of the lane i at the time t, Pos (i, k) is 1;

definition I _m×1 As a vector of the cross-relations of lanes and road sections, I _i A link number indicating the lane i is located;

definition G _m×1 Is the green time vector, G _i Indicating the remaining green light transit time of lane i in the current cycle at time t.

Traffic environment state s is defined by Distribution _m×n×4 、Pos _m×n×1 、I _m×1 And G _m×1 The formed set can effectively control the overall state and trend of the regional traffic.

Under the present embodiment, referring to fig. 1, the road traffic state result Distribution of lane i _i×n×4 、Pos _i×n×1 、I _i×1 And G _i×1 As shown with reference to FIG. 2;

(2) constructing a reinforcement learning control model, defining the input and the output of the model, and comprising the following steps:

referring to fig. 4, the reinforcement learning control model adopts a distributed control mode, and an agent needs to give a phase scheme and a timing scheme of an intersection at the same time; each intersection is used as an independent agent, and a reinforcement learning control model is independently trained; taking all the entrances of the intersection and the adjacent intersections as monitoring ranges, each intelligent agent collects the path information and the position information of all vehicles at the intersection in real time, and simultaneously obtains the path information and the position information of all vehicles at the entrances of the adjacent intersections and the remaining green time of each lane from the adjacent intersections as the input of a PPO2 algorithm.

The PPO2 model comprises an action model Actor and an evaluation model Critic, and the action model outputs a signal control scheme a; the evaluation model outputs an evaluation v of the signal control scheme a.

To improve the efficiency of training, Actor and Critic will share the underlying input layer. Meanwhile, the PPO2 algorithm is combined with a long and short memory model to enhance the memory of the model to the historical state, so that the intelligent agent can make more reasonable decisions.

(3) The intelligent agent interacts with the road traffic environment, and specifically comprises the following steps: at time t, the agent reads the road traffic status s _t Outputting a signal scheme at the time t according to the PPO2 control model

The release phase and green time at this intersection are given.

The phase set is defined as the combination of all non-conflicted traffic flows at the intersection; for example, for a typical crossroad with independent entry lanes for each flow direction, the action set is defined as { north-south straight, north-south left turn, east-west straight, east-west left turn, south-north straight left, north-east straight left, east-west straight left }, and the duration of each signal phase execution is not fixed.

Signal control scheme generated by agent at time t

Signal control scheme generated by queue length priority strategy instead of direct signal control

The distance factor gamma is calculated. The queuing length priority strategy means that the intersection always gives priority to the phase green light time with the longest queuing length; the distance factor gamma is used for measuring the signal control scheme of the PPO2 control model output

Signaling in conjunction with "queue length priority policy generationScheme for making

The formula is as follows:

when gamma is larger than a certain threshold value, the intersection signal control scheme generated by the queuing priority length strategy is actually implemented at the intersection at the time t, namely

Otherwise, implementing an intersection signal control scheme obtained by the PPO2 control model at the intersection, namely

Action a _t After implementation, the traffic environment enters the next state s _t+1 。

Under the present embodiment, the signal control scheme generated by the DPPO2 control model

Signal control scheme generated by queuing length priority strategy

For example, there are

If the threshold value sigma is 6, implementing an intersection signal control scheme obtained by a PPO2 control model at the intersection, namely

To a _t The conflict-free processing is carried out by _t ＝[15，0，0，0，0，0，0,0]。

(4) Designing an incentive r for interaction of an intelligent agent and the environment, aiming at reducing the queuing length of an intersection, reducing vehicle delay and avoiding downstream traffic jam, wherein the incentive is defined as the weighted sum of the queuing length weighted by the waiting time of the first vehicle in an entrance way and the queuing length index in an exit way, namely: reward is the waiting time w of the first vehicle, the inlet lane queuing length q + delta, the outlet lane queuing length index f:

in the formula I _in 、l _o8t Respectively an inlet channel and an outlet channel of the intersection; w is a _i 、q _i Respectively the first vehicle waiting time and the queuing length of the lane i; f. of _j Is a Boolean variable used for measuring whether the queuing length of the outlet passage exceeds the length L of the road section _j Three fourths of the total area of the coating, if

F is then _j 1, otherwise f _j ＝0；L _j Is the road segment length of lane j; q. q.s _j The length of the queue for lane j; delta is a penalty factor.

In the present embodiment, with reference to figure 3,

the queue length q of each entrance lane of the intersection is [20,14,32,20,15,24,20,15,20,26,18,18,12,30 ];

the waiting time w of the first vehicle at each entrance lane is [25,25,15,15,0,0,36,25,25,15,15,0,0,36 ];

each outlet channel queue length p is [5,22,12,14,118,34,12,18,18,10,5,24,5,13], and a boolean variable f is obtained by conversion [0,0,0,0,1,0,0,0,0,0,0,0 ];

taking δ as 100 and converting w into the unit of hour, the following is obtained:

r≈-16.83-100*1＝-116.83

(5) and storing the data of the interaction between the intelligent agent and the environment into a database for Replay buffer. Data to<s _t ,a _t ,r _t+1 ,s _t+1 >When the database is accumulated to the set size Z, the PPO2 model parameters are updated, the steps are as follows:

(5.1) initializing reinforcement learning control model parameters, including:

initializing values of the hyper-parameters, including a learning rate α of 0.0001; the threshold σ of the distance factor is 6; the penalty factor delta is 100, and Z is 512;

define Actor _ old _θ′ 、Critic_old _w′ Is Actor _θ 、Critic _w Copy of the model, Actor _ old _θ′ The parameters of the model are equal to Actor _θ Parameters when not updated are kept unchanged in the updating process;

is Actor _θ And Critic _w Setting the training times n _ operator to be 10 and n _ critical to be 10;

(5.2) use of all data x in the database _t ＝<s _t ，a _t ,r _t+1 ，s _t+1 >. update the action model in the PPO2 control model, including:

(5.21) calculation of A(s) _t ，a _t )＝r _t+1 +τV _w (s _t+1 )-V _w (s _t )

(5.22) calculating Actor _θ Gradient of the model:

wherein E represents a mathematical expectation; (s) _t ,a _t )～π _θ′ Data indicating use is by Actor _ old _θ′ Obtaining a model; p is _θ (a _t |s _t )、P _θ′ (a _t |s _t ) Is an action model Actor _θ 、Actor_old _θ′ Information s on road traffic conditions _t Lower implementation Signal control scheme a _t The probability of (d);

representing the derivation of the parameter theta;

(5.23) updating the parameter theta according to an Adam optimization method;

(5.24) repeating step (5.22) -step (5.23)10 times;

(5.3) use of all data x in the database _t ＝<s _t ,a _t ,r _t+1 ,s _t+1 >. update the evaluation model in the PPO2 control model, including:

(5.31) calculation of A(s) _t ，a _t )＝r _t+1 +τV _w′ (s _t+1 )-V _w (s _t )；

In the formula, V _w′ (s _t+1 ) Represents the evaluation model Critic _ old _w′ The output result of (1);

(5.32) calculation of evaluation model Critic _w Gradient (2):

in the formula (I), the compound is shown in the specification,

representing the derivation of the parameter w;

(5.33) updating the parameter w according to the Adam optimization method;

(5.34) repeating step (5.31) -step (5.33)10 times;

(5.4) emptying the Replay buffer and repeating the steps (3) - (5).

Claims

1. A reinforcement learning area signal control method based on a vehicle planned path is characterized by comprising the following steps:

step 1, designing a control framework of an intelligent agent in traffic signal control of a target area, and modeling a road traffic state, wherein the control framework comprises the following steps: taking each intersection in the target area as an independent intelligent agent, and constructing a respective corresponding reinforcement learning control model and a database for each independent intelligent agent;

the road traffic state information is a set formed by a vehicle planning path matrix, a vehicle position matrix, a corresponding relation vector of a lane and a road section and a green time vector;

the green time vector is G _m×1 Represents; then G is _i Representing the remaining green light passing time of the current period of the lane i at the moment t;

the calculation formula of the distance factor is as follows:

wherein gamma is a distance factor;

a signal control scheme generated for the queuing length priority strategy;

2. The signal control method for the reinforcement learning area based on the vehicle planned path as claimed in claim 1, wherein the road traffic state information collected by the intersection intelligent agents in the target area, the signal control schemes corresponding to the intersections respectively and the reward of interaction between the intersection intelligent agents and the environment in the step 6 are provided<s _t ，a _t ，r _t+1 ，s _t+1 >Storing the form of the data into a database corresponding to each cross port; wherein s is _t Representing road traffic state information collected by the intelligent body at the intersection at the time t; a is _t Representing a signal control scheme implemented at the intersection at the time t; r is _t+1 A reward representing interaction of the intersection agent with the environment at time t + 1; s _t+1 And the road traffic state information collected by the intersection intelligent agent at the moment t +1 is represented.

3. The reinforcement learning area signal control method based on the vehicle planned path according to claim 1, wherein the reward for interaction between the intersection intelligent agent and the environment is calculated from the first vehicle waiting time of an intersection entrance lane, the intersection entrance lane queuing length and the intersection exit lane queuing length, and specifically comprises the following steps:

F is then _j 1, otherwise f _j ＝0；L _j Is the road segment length of lane j; q. q.s _j Of lane jThe length of the queue; delta is a penalty factor.

4. The reinforcement learning area signal control method based on the vehicle planned path according to claim 2, wherein in step 6, when the data information stored in the intersection database is accumulated to a set size, the reinforcement learning control model parameters corresponding to the intersection are updated, and all data in the database are emptied after the update is completed, and the method comprises:

step 6.2, all data x in the database are utilized _t ＝<s _t ，a _t ，r _t+1 ，s _t+1 >Updating the action model in the reinforcement learning control model, comprising:

step 6.21, calculate A(s) _t ，a _t )＝r _t+1 +τV _w (s _t+1 )-V _w (s _t )

In the formula, V _w (s _t+1 ) To evaluate the model Critic _w The evaluation result is output at the moment t + 1; v _w (s _t ) The evaluation result is output by the evaluation model at the time t; τ is a reduction factor; a(s) _t ，a _t ) For information s on road traffic conditions _t Lower implementation Signal control scheme a _t The advantages of (a);

step 6.22, calculating Actor _θ Gradient of the model:

wherein E represents a mathematical expectation; (s) _t ，a _t )～π _θ′ Indicates that the data used is by Actor _ old _θ′ Obtaining a model; p _θ (a _t |s _t )、P _θ′ (a _t |s _t ) Is an action model Actor _θ 、Actor_old _θ′ Information s on road traffic conditions _t Lower implementation Signal control scheme a _t The probability of (d);

representing the derivation of the parameter theta;

6.23, updating the parameter theta according to an Adam optimization method;

step 6.24, repeat step 6.22-step 6.23n _ actor times;

step 6.3, all data x in the database are utilized _t ＝<s _t ，a _t ，r _t+1 ，s _t+1 >Updating an evaluation model in the reinforcement learning control model, comprising:

step 6.31, calculate A(s) _t ，a _t )＝r _t+1 +τV _w′ (s _t+1 )-V _w (s _t )；

step 6.32, calculating an evaluation model Critic _w Gradient of (a):

in the formula (I), the compound is shown in the specification,

representing the derivation of the parameter w;

step 6.33, updating the parameter w according to an Adam optimization method;

step 6.34, repeat step 6.31-step 6.33n _ critical times;

and 6.4, emptying all data information in the database.