CN113487902B - Reinforced learning area signal control method based on vehicle planned path - Google Patents

Reinforced learning area signal control method based on vehicle planned path Download PDF

Info

Publication number
CN113487902B
CN113487902B CN202110534127.1A CN202110534127A CN113487902B CN 113487902 B CN113487902 B CN 113487902B CN 202110534127 A CN202110534127 A CN 202110534127A CN 113487902 B CN113487902 B CN 113487902B
Authority
CN
China
Prior art keywords
intersection
model
signal control
lane
reinforcement learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110534127.1A
Other languages
Chinese (zh)
Other versions
CN113487902A (en
Inventor
王昊
卢云雪
董长印
杨朝友
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yangzhou Fama Intelligent Equipment Co ltd
Southeast University
Original Assignee
Yangzhou Fama Intelligent Equipment Co ltd
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yangzhou Fama Intelligent Equipment Co ltd, Southeast University filed Critical Yangzhou Fama Intelligent Equipment Co ltd
Priority to CN202110534127.1A priority Critical patent/CN113487902B/en
Publication of CN113487902A publication Critical patent/CN113487902A/en
Application granted granted Critical
Publication of CN113487902B publication Critical patent/CN113487902B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/09Arrangements for giving variable traffic instructions
    • G08G1/0962Arrangements for giving variable traffic instructions having an indicator mounted inside the vehicle, e.g. giving voice messages
    • G08G1/0968Systems involving transmission of navigation instructions to the vehicle
    • G08G1/096833Systems involving transmission of navigation instructions to the vehicle where different aspects are considered when computing the route
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • G06Q10/047Optimisation of routes or paths, e.g. travelling salesman problem
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/26Government or public services
    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/01Detecting movement of traffic to be counted or controlled
    • G08G1/0104Measuring and analyzing of parameters relative to traffic conditions
    • G08G1/0125Traffic data processing
    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/09Arrangements for giving variable traffic instructions
    • G08G1/0962Arrangements for giving variable traffic instructions having an indicator mounted inside the vehicle, e.g. giving voice messages
    • G08G1/0967Systems involving transmission of highway information, e.g. weather, speed limits
    • G08G1/096766Systems involving transmission of highway information, e.g. weather, speed limits where the system is characterised by the origin of the information transmission
    • G08G1/096791Systems involving transmission of highway information, e.g. weather, speed limits where the system is characterised by the origin of the information transmission where the origin of the information is another vehicle
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The invention discloses a reinforcement learning area signal control method based on a vehicle planned path, which is specifically characterized in that under the environment of an internet of vehicles, planned path information and position information of all vehicles at intersections in an intelligent agent control range are collected, and distributed signal control is carried out on road intersections in an area by utilizing a reinforcement learning PPO2 algorithm, so that the linkage optimization of area traffic is realized. Particularly, a control framework for controlling a regional traffic signal by multi-agent reinforcement learning is provided; defining a road traffic state based on the vehicle planned path information and the vehicle position information; defining an action variable for controlling an intersection signal; the intelligent agent and traffic environment interactive reward is defined by taking the aims of reducing the queuing length of an intersection, reducing vehicle delay and avoiding downstream traffic jam as targets; meanwhile, a distance factor is provided for measuring the distance between a control scheme generated by the PPO2 algorithm and a queuing priority length priority strategy generation scheme, and abnormal road traffic disturbance caused by a poor control method output by the PPO2 algorithm is avoided.

Description

Reinforced learning area signal control method based on vehicle planned path
Technical Field
The invention belongs to the field of traffic management and control, and particularly relates to a reinforcement learning area signal control method based on a vehicle planned path.
Background
In the internet of vehicles environment, vehicles exchange information including local paths of the vehicles, vehicle position information, vehicle speeds and the like with roadside facilities in real time through vehicle-mounted equipment. Reinforcement learning based signal control methods typically incorporate vehicle position and speed information into the algorithm inputs to develop a more accurate signal control scheme. However, the local route information of the vehicles in the car networking environment is generally easily obtained and is information that can reflect the distribution of the vehicles on the road network and the traffic flow, but cannot be used in the signal control. In addition, when the existing multi-agent reinforcement learning algorithm is used for carrying out distributed control on regional traffic, a single intersection is usually used as an independent agent, and the feedback obtained after the agent generates a signal control scheme usually only considers the vehicle queuing condition or delay of the intersection, but the design mode is not beneficial to the combined control of the regional traffic. In addition, existing reinforcement learning models typically require interaction with a traffic simulator, such as SUMO, through a pre-stage, to accumulate data for model training. However, the traffic simulation environment is different from a real traffic system, and when the reinforcement learning model trained by the simulator is transferred to the actual traffic environment, the control effect of the model is poor.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the problems, the invention provides a reinforcement learning area signal control method based on a vehicle planning path; in the reinforcement learning algorithm, the local planned path of the vehicle is used as the reference information of the signal control scheme as state input, so that the overall state and trend of the regional traffic can be effectively controlled, the predictability of the control scheme is improved, and the optimization of the overall traffic running state of the region is facilitated.
The technical scheme is as follows: in order to realize the purpose of the invention, the technical scheme adopted by the invention is as follows: a reinforcement learning area signal control method based on a vehicle planned path comprises the following steps:
step 1, designing a control framework of an intelligent agent in traffic signal control of a target area, and modeling a road traffic state, wherein the control framework comprises the following steps: taking each intersection in the target area as an independent agent, and constructing a respective corresponding reinforcement learning control model and a database for each independent agent;
step 2: enabling an independent intelligent agent at the intersection to interact with the environment of the intersection, and collecting road traffic state information within a certain range of the intersection in real time; the certain range includes the intersection and an entrance lane of an adjacent intersection;
step 3, taking the road traffic state information of the intersection at the current moment as the input of a reinforcement learning control model corresponding to the intersection to obtain an intersection signal control scheme at the next moment of the current moment and an evaluation result of the control scheme; the signal control scheme comprises a release phase and a green time;
step 4, generating an intersection signal control scheme at the next moment of the current moment by using a queuing priority length strategy according to the road traffic state information of the intersection at the current moment;
step 5, calculating a distance factor by using an intersection signal control scheme obtained by the reinforcement learning control model and an intersection signal control scheme generated by a queuing priority length strategy; if the calculated distance factor is larger than the set distance threshold, implementing an intersection signal control scheme generated by a queuing priority length strategy at the intersection; otherwise, implementing an intersection signal control scheme obtained by a reinforcement learning control model at the intersection;
and 6, storing the road traffic state information collected by the intersection intelligent bodies in the target area, the signal control schemes corresponding to the intersections and the rewards of the intersection intelligent bodies and the environment interaction into databases corresponding to the intersections in real time, updating the reinforcement learning control model parameters corresponding to the intersections when the data information stored in the intersection database is judged to be accumulated to a set size, emptying all data in the database after the updating is finished, and returning to the step 2.
Furthermore, the road traffic state information in the step 2 is a set formed by a vehicle planning path matrix, a vehicle position matrix, a corresponding relation vector of a lane and a road section and a green time vector, and the overall state and trend of regional traffic can be effectively grasped;
distribution for vehicle planned path matrix m×n×4 Representing, wherein each row corresponds to one lane, the lanes in the monitoring range of the intelligent agent are separated by 1 meter to obtain a plurality of cells, and each column corresponds to one cell; if a vehicle exists in the kth cell of the lane i at the time t, storing four planned road segment numbers which may be passed by the vehicle after the time t in Distribution (i, k,1), Distribution (i, k,2), Distribution (i, k,3) and Distribution (i, k,4) respectively;
pos for the vehicle position matrix m×n×1 Representing, wherein each row corresponds to a lane within the agent monitoring range; each column corresponds to one cell; if a vehicle exists in the kth cell of the lane i at the time t, Pos (i, k) is 1; if no vehicle exists in the kth cell of the lane i at the time t, Pos (i, k) is 0;
the corresponding relation vector of the lane and the road section is I m×1 Represents; then I i A link number indicating the lane i is located;
the green time vector is G m×1 Represents; then G is i Representing the remaining green light transit time of the current cycle in which lane i is located at time t.
Further, the calculation formula of the distance factor in step 5 is as follows:
Figure BDA0003069119330000021
wherein γ is a distance factor;
Figure BDA0003069119330000022
an intersection signal control scheme obtained for a reinforcement learning control model;
Figure BDA0003069119330000023
for queue length priority policy generationAnd (3) forming a signal control scheme.
Further, in step 6, the road traffic state information collected by the intersection intelligent agents in the target area, the signal control schemes corresponding to the intersections respectively and rewards for interaction between the intersection intelligent agents and the environment<s t ,a t ,r t+1 ,s t+1 >Storing the form of the data into a database corresponding to each cross port; wherein s is t Representing road traffic state information collected by the intelligent agent at the intersection at the moment t; a is t Representing a signal control scheme implemented at the intersection at the time t; r is t+1 A reward representing interaction of the intersection agent with the environment at time t + 1; s t+1 And the road traffic state information collected by the intersection intelligent agent at the moment t +1 is represented.
Further, the reward of interaction between the intelligent agent and the environment at the intersection is obtained by calculating the waiting time of the first vehicle at the intersection entrance lane, the queuing length of the intersection entrance lane and the queuing length of the intersection exit lane, and specifically comprises the following steps:
Figure BDA0003069119330000031
in the formula, r t+1 Awarding interaction between the intelligent agent at the intersection and the environment at the time t + 1; l in 、l out Respectively an inlet channel and an outlet channel of the intersection; w is a i 、q i Respectively the first vehicle waiting time and the queuing length of the lane i; f. of j Is a Boolean variable used for measuring whether the queuing length of the outlet passage exceeds the length L of the road section j Three fourths of the total area of the coating, if
Figure BDA0003069119330000032
Then f is j 1, otherwise f j =0;L j Is the road segment length of lane j; q. q.s j The length of the queue for lane j; 0 is a penalty factor.
Further, step 6, when the data information stored in the intersection database is accumulated to the set size, updating the reinforcement learning control model parameter corresponding to the intersection, and emptying all data in the database after the update is completed, where the method includes:
step 6.1, initializing reinforcement learning control model parameters, including:
initializing values of the hyper-parameters, including a learning rate alpha, a threshold sigma of a distance factor and a penalty factor delta;
is an action model Actor θ Parameter(s) and evaluation model Critic w The parameters of (a) are assigned with initial values, wherein theta and w are respectively parameters of the action model to be updated and the evaluation model;
define Actor _ old θ′ 、Critic_old w′ Is Actor θ 、Critic w Copies of the model, Actor _ old θ′ The parameters of the model are equal to Actor θ Parameters when not updated are kept unchanged in the updating process;
is Actor θ And Critic w Setting training times n _ operator and n _ critical;
step 6.2, utilize all data x in the database t =<s t ,a t ,r t+1 ,s t+1 Update an action model in a reinforcement learning control model, comprising:
step 6.21, calculate A(s) t ,a t )=r t+1 +τV w (s t+1 )-V w (s t )
In the formula, V w (s t+1 ) To evaluate the model Critic w The evaluation result is output at the moment t + 1; v w (s t ) The evaluation result is output by the evaluation model at the time t; τ is a reduction factor; a(s) t ,a t ) For information on road traffic conditions s t Lower implementation Signal control scheme a t The advantages of (a);
step 6.22, calculating Actor θ Gradient of the model:
Figure BDA0003069119330000041
wherein E represents a mathematical expectation; (s) t ,a t )~π θ′ To representThe data used is by Actor _ old θ′ Obtaining a model; p θ (a t |s t )、P θ′ (a t |s t ) Is an action model Actor θ 、Actor_old θ′ Information s on road traffic conditions t Lower implementation Signal control scheme a t The probability of (d);
Figure BDA0003069119330000042
representing the derivation of the parameter theta;
6.23, updating the parameter theta according to an Adam optimization method;
step 6.24, repeat step 6.22-step 6.23n _ actor times;
step 6.3, all data x in the database are utilized t =<s t ,a t ,r t+1 ,s t+1 The method for updating the evaluation model in the reinforcement learning control model comprises the following steps:
step 6.31, calculate A(s) t ,a t )=r t+1 +τV w′ (s t+1 )-V w (s t );
In the formula, V w′ (s t+1 ) Represents the evaluation model Critic _ old w′ The output result of (2);
step 6.32, calculating an evaluation model Critic w Gradient (2):
Figure BDA0003069119330000043
in the formula (I), the compound is shown in the specification,
Figure BDA0003069119330000044
representing the derivation of the parameter w;
step 6.33, updating the parameter w according to an Adam optimization method;
step 6.34, repeat step 6.31-step 6.33n _ critical times;
and 6.4, emptying all data information in the database.
Has the advantages that: compared with the prior art, the technical scheme of the invention has the following beneficial technical effects:
the intersection and the adjacent intersections in the target area are taken as monitoring objects, and regional vehicle planning path data under the Internet of vehicles environment are considered in the state variables, so that the road traffic state is represented more comprehensively; constructing a regional signal control reinforcement learning model by combining a PPO algorithm and an LSTM algorithm; the distance factor is provided to measure the distance between the reinforcement learning model and the control scheme generated by the traditional queuing length priority strategy, so that the influence of the untrained mature model on road traffic safety and traffic efficiency due to the generation of an improper signal control scheme in the online learning process can be effectively avoided.
Drawings
FIG. 1 is a schematic diagram of modeling a lane condition, under an embodiment;
FIG. 2 is a diagram of a road traffic condition result for a lane under one embodiment;
FIG. 3 is a schematic illustration of an intersection in one embodiment;
FIG. 4 is a schematic structural diagram of a PPO2 model according to an embodiment;
FIG. 5 is a logic flow diagram of a method of the present invention in one embodiment.
Detailed Description
The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.
Referring to fig. 5, the method of the present invention is further illustrated by taking an intersection as an example. The invention relates to a reinforcement learning area signal control method based on a vehicle planned path, which specifically comprises the following steps:
(1) designing a control framework of an intelligent agent in regional traffic signal control, modeling a road traffic state, and comprising the following steps:
under the environment of the Internet of vehicles, the regional path and the vehicle position information of the vehicle are fully utilized, so that the road traffic state is more comprehensively grasped and analyzed.
Specifically, each intersection is used as an independent agent, the intersection and the entrance lane of the adjacent intersection are used as observation ranges, and planning path information and vehicle position information of vehicles in the ranges are collected.
Defining the regional vehicle planning path matrix as Distribution m×n×4 Each row corresponds to one lane, and the lanes in the intelligent agent monitoring range are divided into 1 meter of cells; if a vehicle exists in the kth cell of the lane i at the time t, storing four planned road segment numbers after the time t of the vehicle in Distribution (i, f,1), Distribution (i, k,2), Distribution (i, k,3) and Distribution (i, k,4) respectively;
defining a vehicle position matrix as Pos m×n×1 Each row of the matrix corresponds to a lane in the monitoring range, the lane takes 1 meter as a cell, and if a vehicle exists in the kth cell of the lane i at the time t, Pos (i, k) is 1;
definition I m×1 As a vector of the cross-relations of lanes and road sections, I i A link number indicating the lane i is located;
definition G m×1 Is the green time vector, G i Indicating the remaining green light transit time of lane i in the current cycle at time t.
Traffic environment state s is defined by Distribution m×n×4 、Pos m×n×1 、I m×1 And G m×1 The formed set can effectively control the overall state and trend of the regional traffic.
Under the present embodiment, referring to fig. 1, the road traffic state result Distribution of lane i i×n×4 、Pos i×n×1 、I i×1 And G i×1 As shown with reference to FIG. 2;
(2) constructing a reinforcement learning control model, defining the input and the output of the model, and comprising the following steps:
referring to fig. 4, the reinforcement learning control model adopts a distributed control mode, and an agent needs to give a phase scheme and a timing scheme of an intersection at the same time; each intersection is used as an independent agent, and a reinforcement learning control model is independently trained; taking all the entrances of the intersection and the adjacent intersections as monitoring ranges, each intelligent agent collects the path information and the position information of all vehicles at the intersection in real time, and simultaneously obtains the path information and the position information of all vehicles at the entrances of the adjacent intersections and the remaining green time of each lane from the adjacent intersections as the input of a PPO2 algorithm.
The PPO2 model comprises an action model Actor and an evaluation model Critic, and the action model outputs a signal control scheme a; the evaluation model outputs an evaluation v of the signal control scheme a.
To improve the efficiency of training, Actor and Critic will share the underlying input layer. Meanwhile, the PPO2 algorithm is combined with a long and short memory model to enhance the memory of the model to the historical state, so that the intelligent agent can make more reasonable decisions.
(3) The intelligent agent interacts with the road traffic environment, and specifically comprises the following steps: at time t, the agent reads the road traffic status s t Outputting a signal scheme at the time t according to the PPO2 control model
Figure BDA0003069119330000061
The release phase and green time at this intersection are given.
The phase set is defined as the combination of all non-conflicted traffic flows at the intersection; for example, for a typical crossroad with independent entry lanes for each flow direction, the action set is defined as { north-south straight, north-south left turn, east-west straight, east-west left turn, south-north straight left, north-east straight left, east-west straight left }, and the duration of each signal phase execution is not fixed.
Signal control scheme generated by agent at time t
Figure BDA0003069119330000062
Signal control scheme generated by queue length priority strategy instead of direct signal control
Figure BDA0003069119330000063
The distance factor gamma is calculated. The queuing length priority strategy means that the intersection always gives priority to the phase green light time with the longest queuing length; the distance factor gamma is used for measuring the signal control scheme of the PPO2 control model output
Figure BDA0003069119330000064
Signaling in conjunction with "queue length priority policy generationScheme for making
Figure BDA0003069119330000065
The formula is as follows:
Figure BDA0003069119330000066
when gamma is larger than a certain threshold value, the intersection signal control scheme generated by the queuing priority length strategy is actually implemented at the intersection at the time t, namely
Figure BDA0003069119330000067
Otherwise, implementing an intersection signal control scheme obtained by the PPO2 control model at the intersection, namely
Figure BDA0003069119330000068
Action a t After implementation, the traffic environment enters the next state s t+1
Under the present embodiment, the signal control scheme generated by the DPPO2 control model
Figure BDA0003069119330000069
Signal control scheme generated by queuing length priority strategy
Figure BDA00030691193300000610
For example, there are
Figure BDA00030691193300000611
If the threshold value sigma is 6, implementing an intersection signal control scheme obtained by a PPO2 control model at the intersection, namely
Figure BDA00030691193300000612
To a t The conflict-free processing is carried out by t =[15,0,0,0,0,0,0,0]。
(4) Designing an incentive r for interaction of an intelligent agent and the environment, aiming at reducing the queuing length of an intersection, reducing vehicle delay and avoiding downstream traffic jam, wherein the incentive is defined as the weighted sum of the queuing length weighted by the waiting time of the first vehicle in an entrance way and the queuing length index in an exit way, namely: reward is the waiting time w of the first vehicle, the inlet lane queuing length q + delta, the outlet lane queuing length index f:
Figure BDA0003069119330000071
in the formula I in 、l o8t Respectively an inlet channel and an outlet channel of the intersection; w is a i 、q i Respectively the first vehicle waiting time and the queuing length of the lane i; f. of j Is a Boolean variable used for measuring whether the queuing length of the outlet passage exceeds the length L of the road section j Three fourths of the total area of the coating, if
Figure BDA0003069119330000072
F is then j 1, otherwise f j =0;L j Is the road segment length of lane j; q. q.s j The length of the queue for lane j; delta is a penalty factor.
In the present embodiment, with reference to figure 3,
the queue length q of each entrance lane of the intersection is [20,14,32,20,15,24,20,15,20,26,18,18,12,30 ];
the waiting time w of the first vehicle at each entrance lane is [25,25,15,15,0,0,36,25,25,15,15,0,0,36 ];
each outlet channel queue length p is [5,22,12,14,118,34,12,18,18,10,5,24,5,13], and a boolean variable f is obtained by conversion [0,0,0,0,1,0,0,0,0,0,0,0 ];
taking δ as 100 and converting w into the unit of hour, the following is obtained:
r≈-16.83-100*1=-116.83
(5) and storing the data of the interaction between the intelligent agent and the environment into a database for Replay buffer. Data to<s t ,a t ,r t+1 ,s t+1 >When the database is accumulated to the set size Z, the PPO2 model parameters are updated, the steps are as follows:
(5.1) initializing reinforcement learning control model parameters, including:
initializing values of the hyper-parameters, including a learning rate α of 0.0001; the threshold σ of the distance factor is 6; the penalty factor delta is 100, and Z is 512;
is an action model Actor θ Parameter(s) and evaluation model Critic w The parameters of (a) are assigned with initial values, wherein theta and w are respectively parameters of the action model to be updated and the evaluation model;
define Actor _ old θ′ 、Critic_old w′ Is Actor θ 、Critic w Copy of the model, Actor _ old θ′ The parameters of the model are equal to Actor θ Parameters when not updated are kept unchanged in the updating process;
is Actor θ And Critic w Setting the training times n _ operator to be 10 and n _ critical to be 10;
(5.2) use of all data x in the database t =<s t ,a t ,r t+1 ,s t+1 >. update the action model in the PPO2 control model, including:
(5.21) calculation of A(s) t ,a t )=r t+1 +τV w (s t+1 )-V w (s t )
In the formula, V w (s t+1 ) To evaluate the model Critic w The evaluation result is output at the moment t + 1; v w (s t ) The evaluation result is output by the evaluation model at the time t; τ is a reduction factor; a(s) t ,a t ) For information on road traffic conditions s t Lower implementation Signal control scheme a t The advantages of (a);
(5.22) calculating Actor θ Gradient of the model:
Figure BDA0003069119330000081
wherein E represents a mathematical expectation; (s) t ,a t )~π θ′ Data indicating use is by Actor _ old θ′ Obtaining a model; p is θ (a t |s t )、P θ′ (a t |s t ) Is an action model Actor θ 、Actor_old θ′ Information s on road traffic conditions t Lower implementation Signal control scheme a t The probability of (d);
Figure BDA0003069119330000082
representing the derivation of the parameter theta;
(5.23) updating the parameter theta according to an Adam optimization method;
(5.24) repeating step (5.22) -step (5.23)10 times;
(5.3) use of all data x in the database t =<s t ,a t ,r t+1 ,s t+1 >. update the evaluation model in the PPO2 control model, including:
(5.31) calculation of A(s) t ,a t )=r t+1 +τV w′ (s t+1 )-V w (s t );
In the formula, V w′ (s t+1 ) Represents the evaluation model Critic _ old w′ The output result of (1);
(5.32) calculation of evaluation model Critic w Gradient (2):
Figure BDA0003069119330000083
in the formula (I), the compound is shown in the specification,
Figure BDA0003069119330000084
representing the derivation of the parameter w;
(5.33) updating the parameter w according to the Adam optimization method;
(5.34) repeating step (5.31) -step (5.33)10 times;
(5.4) emptying the Replay buffer and repeating the steps (3) - (5).

Claims (4)

1. A reinforcement learning area signal control method based on a vehicle planned path is characterized by comprising the following steps:
step 1, designing a control framework of an intelligent agent in traffic signal control of a target area, and modeling a road traffic state, wherein the control framework comprises the following steps: taking each intersection in the target area as an independent intelligent agent, and constructing a respective corresponding reinforcement learning control model and a database for each independent intelligent agent;
step 2: enabling an independent intelligent agent at the intersection to interact with the environment of the intersection, and collecting road traffic state information within a certain range of the intersection in real time; the certain range includes the intersection and an entrance lane of an adjacent intersection;
the road traffic state information is a set formed by a vehicle planning path matrix, a vehicle position matrix, a corresponding relation vector of a lane and a road section and a green time vector;
distribution for vehicle planned path matrix m×n×4 Representing, wherein each row corresponds to one lane, the lanes in the monitoring range of the intelligent agent are separated by 1 meter to obtain a plurality of cells, and each column corresponds to one cell; if a vehicle exists in the kth cell of the lane i at the time t, storing four planned road segment numbers which may be passed by the vehicle after the time t in Distribution (i, k,1), Distribution (i, k,2), Distribution (i, k,3) and Distribution (i, k,4) respectively;
pos for the vehicle position matrix m×n×1 Representing, wherein each row corresponds to a lane within the agent monitoring range; each column corresponds to one cell; if a vehicle exists in the kth cell of the lane i at the time t, Pos (i, k) is 1; if no vehicle exists in the kth cell of the lane i at the time t, Pos (i, k) is 0;
the corresponding relation vector of the lane and the road section is I m×1 Represents; then I i A link number indicating the lane i is located;
the green time vector is G m×1 Represents; then G is i Representing the remaining green light passing time of the current period of the lane i at the moment t;
step 3, taking the road traffic state information of the intersection at the current moment as the input of a reinforcement learning control model corresponding to the intersection to obtain an intersection signal control scheme at the next moment of the current moment and an evaluation result of the control scheme; the signal control scheme comprises a release phase and a green time;
step 4, generating an intersection signal control scheme at the next moment of the current moment by using a queuing priority length strategy according to the road traffic state information of the intersection at the current moment;
step 5, calculating a distance factor by using an intersection signal control scheme obtained by the reinforcement learning control model and an intersection signal control scheme generated by a queuing priority length strategy; if the calculated distance factor is larger than the set distance threshold, implementing an intersection signal control scheme generated by a queuing priority length strategy at the intersection; otherwise, implementing an intersection signal control scheme obtained by a reinforcement learning control model at the intersection;
the calculation formula of the distance factor is as follows:
Figure FDA0003657071950000021
wherein gamma is a distance factor;
Figure FDA0003657071950000022
an intersection signal control scheme obtained for a reinforcement learning control model;
Figure FDA0003657071950000023
a signal control scheme generated for the queuing length priority strategy;
and 6, storing the road traffic state information collected by the intersection intelligent bodies in the target area, the signal control schemes corresponding to the intersections and the rewards of the intersection intelligent bodies and the environment interaction into databases corresponding to the intersections in real time, updating the reinforcement learning control model parameters corresponding to the intersections when the data information stored in the intersection database is judged to be accumulated to a set size, emptying all data in the database after the updating is finished, and returning to the step 2.
2. The signal control method for the reinforcement learning area based on the vehicle planned path as claimed in claim 1, wherein the road traffic state information collected by the intersection intelligent agents in the target area, the signal control schemes corresponding to the intersections respectively and the reward of interaction between the intersection intelligent agents and the environment in the step 6 are provided<s t ,a t ,r t+1 ,s t+1 >Storing the form of the data into a database corresponding to each cross port; wherein s is t Representing road traffic state information collected by the intelligent body at the intersection at the time t; a is t Representing a signal control scheme implemented at the intersection at the time t; r is t+1 A reward representing interaction of the intersection agent with the environment at time t + 1; s t+1 And the road traffic state information collected by the intersection intelligent agent at the moment t +1 is represented.
3. The reinforcement learning area signal control method based on the vehicle planned path according to claim 1, wherein the reward for interaction between the intersection intelligent agent and the environment is calculated from the first vehicle waiting time of an intersection entrance lane, the intersection entrance lane queuing length and the intersection exit lane queuing length, and specifically comprises the following steps:
Figure FDA0003657071950000024
in the formula, r t+1 Awarding interaction between the intelligent agent at the intersection and the environment at the time t + 1; l in 、l out Respectively an inlet channel and an outlet channel of the intersection; w is a i 、q i Respectively the first vehicle waiting time and the queuing length of the lane i; f. of j Is a Boolean variable used for measuring whether the queuing length of the outlet passage exceeds the length L of the road section j Three fourths of the total area of the coating, if
Figure FDA0003657071950000025
F is then j 1, otherwise f j =0;L j Is the road segment length of lane j; q. q.s j Of lane jThe length of the queue; delta is a penalty factor.
4. The reinforcement learning area signal control method based on the vehicle planned path according to claim 2, wherein in step 6, when the data information stored in the intersection database is accumulated to a set size, the reinforcement learning control model parameters corresponding to the intersection are updated, and all data in the database are emptied after the update is completed, and the method comprises:
step 6.1, initializing reinforcement learning control model parameters, including:
initializing values of the hyper-parameters, including a learning rate alpha, a threshold sigma of a distance factor and a penalty factor delta;
is an action model Actor θ Parameter(s) and evaluation model Critic w The parameters of (a) are assigned with initial values, wherein theta and w are respectively parameters of the action model to be updated and the evaluation model;
define Actor _ old θ′ 、Critic_old w′ Is Actor θ 、Critic w Copies of the model, Actor _ old θ′ The parameters of the model are equal to Actor θ Parameters when not updated are kept unchanged in the updating process;
is Actor θ And Critic w Setting training times n _ operator and n _ critical;
step 6.2, all data x in the database are utilized t =<s t ,a t ,r t+1 ,s t+1 >Updating the action model in the reinforcement learning control model, comprising:
step 6.21, calculate A(s) t ,a t )=r t+1 +τV w (s t+1 )-V w (s t )
In the formula, V w (s t+1 ) To evaluate the model Critic w The evaluation result is output at the moment t + 1; v w (s t ) The evaluation result is output by the evaluation model at the time t; τ is a reduction factor; a(s) t ,a t ) For information s on road traffic conditions t Lower implementation Signal control scheme a t The advantages of (a);
step 6.22, calculating Actor θ Gradient of the model:
Figure FDA0003657071950000031
wherein E represents a mathematical expectation; (s) t ,a t )~π θ′ Indicates that the data used is by Actor _ old θ′ Obtaining a model; p θ (a t |s t )、P θ′ (a t |s t ) Is an action model Actor θ 、Actor_old θ′ Information s on road traffic conditions t Lower implementation Signal control scheme a t The probability of (d);
Figure FDA0003657071950000032
representing the derivation of the parameter theta;
6.23, updating the parameter theta according to an Adam optimization method;
step 6.24, repeat step 6.22-step 6.23n _ actor times;
step 6.3, all data x in the database are utilized t =<s t ,a t ,r t+1 ,s t+1 >Updating an evaluation model in the reinforcement learning control model, comprising:
step 6.31, calculate A(s) t ,a t )=r t+1 +τV w′ (s t+1 )-V w (s t );
In the formula, V w′ (s t+1 ) Represents the evaluation model Critic _ old w′ The output result of (1);
step 6.32, calculating an evaluation model Critic w Gradient of (a):
Figure FDA0003657071950000033
in the formula (I), the compound is shown in the specification,
Figure FDA0003657071950000034
representing the derivation of the parameter w;
step 6.33, updating the parameter w according to an Adam optimization method;
step 6.34, repeat step 6.31-step 6.33n _ critical times;
and 6.4, emptying all data information in the database.
CN202110534127.1A 2021-05-17 2021-05-17 Reinforced learning area signal control method based on vehicle planned path Active CN113487902B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110534127.1A CN113487902B (en) 2021-05-17 2021-05-17 Reinforced learning area signal control method based on vehicle planned path

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110534127.1A CN113487902B (en) 2021-05-17 2021-05-17 Reinforced learning area signal control method based on vehicle planned path

Publications (2)

Publication Number Publication Date
CN113487902A CN113487902A (en) 2021-10-08
CN113487902B true CN113487902B (en) 2022-08-12

Family

ID=77933576

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110534127.1A Active CN113487902B (en) 2021-05-17 2021-05-17 Reinforced learning area signal control method based on vehicle planned path

Country Status (1)

Country Link
CN (1) CN113487902B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114550470B (en) * 2022-03-03 2023-08-22 沈阳化工大学 Wireless network interconnection intelligent traffic signal lamp
CN114667852B (en) * 2022-03-14 2023-04-14 广西大学 Hedge trimming robot intelligent cooperative control method based on deep reinforcement learning
CN116092297B (en) * 2023-04-07 2023-06-27 南京航空航天大学 Edge calculation method and system for low-permeability distributed differential signal control

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
RU2379761C1 (en) * 2008-07-01 2010-01-20 Государственное образовательное учреждение высшего профессионального образования "Уральский государственный технический университет УПИ имени первого Президента России Б.Н.Ельцина" Method of controlling road traffic at intersection
CN105046987A (en) * 2015-06-17 2015-11-11 苏州大学 Pavement traffic signal lamp coordination control method based on reinforcement learning
CN111915894A (en) * 2020-08-06 2020-11-10 北京航空航天大学 Variable lane and traffic signal cooperative control method based on deep reinforcement learning
CN112365724A (en) * 2020-04-13 2021-02-12 北方工业大学 Continuous intersection signal cooperative control method based on deep reinforcement learning
CN112632858A (en) * 2020-12-23 2021-04-09 浙江工业大学 Traffic light signal control method based on Actor-critical frame deep reinforcement learning algorithm

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
RU2379761C1 (en) * 2008-07-01 2010-01-20 Государственное образовательное учреждение высшего профессионального образования "Уральский государственный технический университет УПИ имени первого Президента России Б.Н.Ельцина" Method of controlling road traffic at intersection
CN105046987A (en) * 2015-06-17 2015-11-11 苏州大学 Pavement traffic signal lamp coordination control method based on reinforcement learning
CN112365724A (en) * 2020-04-13 2021-02-12 北方工业大学 Continuous intersection signal cooperative control method based on deep reinforcement learning
CN111915894A (en) * 2020-08-06 2020-11-10 北京航空航天大学 Variable lane and traffic signal cooperative control method based on deep reinforcement learning
CN112632858A (en) * 2020-12-23 2021-04-09 浙江工业大学 Traffic light signal control method based on Actor-critical frame deep reinforcement learning algorithm

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Compatibility-Based Approach for Routing and Scheduling the Demand Responsive Connector;YUNXUE LU,HAO WANG;《IEEE Access 》;20200526;第8卷;101770 - 101783 *
基于多智能体的城市交通区域协调控制方法;黄艳国等;《武汉理工大学学报(交通科学与工程版)》;20100415;第34卷(第02期);197-200 *

Also Published As

Publication number Publication date
CN113487902A (en) 2021-10-08

Similar Documents

Publication Publication Date Title
CN113487902B (en) Reinforced learning area signal control method based on vehicle planned path
CN111696370A (en) Traffic light control method based on heuristic deep Q network
CN103593535A (en) Urban traffic complex self-adaptive network parallel simulation system and method based on multi-scale integration
CN113436443B (en) Distributed traffic signal control method based on generation of countermeasure network and reinforcement learning
Lin et al. Traffic signal optimization based on fuzzy control and differential evolution algorithm
CN113299078B (en) Multi-mode traffic trunk line signal coordination control method and device based on multi-agent cooperation
CN106781465A (en) A kind of road traffic Forecasting Methodology
Han et al. Leveraging reinforcement learning for dynamic traffic control: A survey and challenges for field implementation
CN113112823A (en) Urban road network traffic signal control method based on MPC
CN116513273A (en) Train operation scheduling optimization method based on deep reinforcement learning
CN113362618B (en) Multi-mode traffic adaptive signal control method and device based on strategy gradient
Tamimi et al. Intelligent traffic light based on genetic algorithm
Shao et al. Machine learning enabled traffic prediction for speed optimization of connected and autonomous electric vehicles
CN115762128B (en) Deep reinforcement learning traffic signal control method based on self-attention mechanism
CN116758765A (en) Multi-target signal control optimization method suitable for multi-mode traffic
CN114627658B (en) Traffic control method for major special motorcade to pass through expressway
CN111311905A (en) Particle swarm optimization wavelet neural network-based expressway travel time prediction method
CN114701517A (en) Multi-target complex traffic scene automatic driving solution based on reinforcement learning
Cenedese et al. A novel control-oriented cell transmission model including service stations on highways
Zhong et al. Deep Q-Learning Network Model for Optimizing Transit Bus Priority at Multiphase Traffic Signal Controlled Intersection
Wei et al. Intersection signal control approach based on pso and simulation
Guo et al. Network Multi-scale Urban Traffic Control with Mixed Traffic Flow
Tan et al. Optimization of signalized traffic network using swarm intelligence
CN116189464B (en) Cross entropy reinforcement learning variable speed limit control method based on refined return mechanism
Miletić et al. Impact of Connected Vehicles on Learning based Adaptive Traffic Control Systems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant