CN113487902B - Reinforced learning area signal control method based on vehicle planned path - Google Patents
Reinforced learning area signal control method based on vehicle planned path Download PDFInfo
- Publication number
- CN113487902B CN113487902B CN202110534127.1A CN202110534127A CN113487902B CN 113487902 B CN113487902 B CN 113487902B CN 202110534127 A CN202110534127 A CN 202110534127A CN 113487902 B CN113487902 B CN 113487902B
- Authority
- CN
- China
- Prior art keywords
- intersection
- model
- signal control
- lane
- reinforcement learning
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G08—SIGNALLING
- G08G—TRAFFIC CONTROL SYSTEMS
- G08G1/00—Traffic control systems for road vehicles
- G08G1/09—Arrangements for giving variable traffic instructions
- G08G1/0962—Arrangements for giving variable traffic instructions having an indicator mounted inside the vehicle, e.g. giving voice messages
- G08G1/0968—Systems involving transmission of navigation instructions to the vehicle
- G08G1/096833—Systems involving transmission of navigation instructions to the vehicle where different aspects are considered when computing the route
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F30/00—Computer-aided design [CAD]
- G06F30/20—Design optimisation, verification or simulation
- G06F30/27—Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/04—Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
- G06Q10/047—Optimisation of routes or paths, e.g. travelling salesman problem
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
- G06Q50/26—Government or public services
-
- G—PHYSICS
- G08—SIGNALLING
- G08G—TRAFFIC CONTROL SYSTEMS
- G08G1/00—Traffic control systems for road vehicles
- G08G1/01—Detecting movement of traffic to be counted or controlled
- G08G1/0104—Measuring and analyzing of parameters relative to traffic conditions
- G08G1/0125—Traffic data processing
-
- G—PHYSICS
- G08—SIGNALLING
- G08G—TRAFFIC CONTROL SYSTEMS
- G08G1/00—Traffic control systems for road vehicles
- G08G1/09—Arrangements for giving variable traffic instructions
- G08G1/0962—Arrangements for giving variable traffic instructions having an indicator mounted inside the vehicle, e.g. giving voice messages
- G08G1/0967—Systems involving transmission of highway information, e.g. weather, speed limits
- G08G1/096766—Systems involving transmission of highway information, e.g. weather, speed limits where the system is characterised by the origin of the information transmission
- G08G1/096791—Systems involving transmission of highway information, e.g. weather, speed limits where the system is characterised by the origin of the information transmission where the origin of the information is another vehicle
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Abstract
The invention discloses a reinforcement learning area signal control method based on a vehicle planned path, which is specifically characterized in that under the environment of an internet of vehicles, planned path information and position information of all vehicles at intersections in an intelligent agent control range are collected, and distributed signal control is carried out on road intersections in an area by utilizing a reinforcement learning PPO2 algorithm, so that the linkage optimization of area traffic is realized. Particularly, a control framework for controlling a regional traffic signal by multi-agent reinforcement learning is provided; defining a road traffic state based on the vehicle planned path information and the vehicle position information; defining an action variable for controlling an intersection signal; the intelligent agent and traffic environment interactive reward is defined by taking the aims of reducing the queuing length of an intersection, reducing vehicle delay and avoiding downstream traffic jam as targets; meanwhile, a distance factor is provided for measuring the distance between a control scheme generated by the PPO2 algorithm and a queuing priority length priority strategy generation scheme, and abnormal road traffic disturbance caused by a poor control method output by the PPO2 algorithm is avoided.
Description
Technical Field
The invention belongs to the field of traffic management and control, and particularly relates to a reinforcement learning area signal control method based on a vehicle planned path.
Background
In the internet of vehicles environment, vehicles exchange information including local paths of the vehicles, vehicle position information, vehicle speeds and the like with roadside facilities in real time through vehicle-mounted equipment. Reinforcement learning based signal control methods typically incorporate vehicle position and speed information into the algorithm inputs to develop a more accurate signal control scheme. However, the local route information of the vehicles in the car networking environment is generally easily obtained and is information that can reflect the distribution of the vehicles on the road network and the traffic flow, but cannot be used in the signal control. In addition, when the existing multi-agent reinforcement learning algorithm is used for carrying out distributed control on regional traffic, a single intersection is usually used as an independent agent, and the feedback obtained after the agent generates a signal control scheme usually only considers the vehicle queuing condition or delay of the intersection, but the design mode is not beneficial to the combined control of the regional traffic. In addition, existing reinforcement learning models typically require interaction with a traffic simulator, such as SUMO, through a pre-stage, to accumulate data for model training. However, the traffic simulation environment is different from a real traffic system, and when the reinforcement learning model trained by the simulator is transferred to the actual traffic environment, the control effect of the model is poor.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the problems, the invention provides a reinforcement learning area signal control method based on a vehicle planning path; in the reinforcement learning algorithm, the local planned path of the vehicle is used as the reference information of the signal control scheme as state input, so that the overall state and trend of the regional traffic can be effectively controlled, the predictability of the control scheme is improved, and the optimization of the overall traffic running state of the region is facilitated.
The technical scheme is as follows: in order to realize the purpose of the invention, the technical scheme adopted by the invention is as follows: a reinforcement learning area signal control method based on a vehicle planned path comprises the following steps:
step 2: enabling an independent intelligent agent at the intersection to interact with the environment of the intersection, and collecting road traffic state information within a certain range of the intersection in real time; the certain range includes the intersection and an entrance lane of an adjacent intersection;
step 3, taking the road traffic state information of the intersection at the current moment as the input of a reinforcement learning control model corresponding to the intersection to obtain an intersection signal control scheme at the next moment of the current moment and an evaluation result of the control scheme; the signal control scheme comprises a release phase and a green time;
step 5, calculating a distance factor by using an intersection signal control scheme obtained by the reinforcement learning control model and an intersection signal control scheme generated by a queuing priority length strategy; if the calculated distance factor is larger than the set distance threshold, implementing an intersection signal control scheme generated by a queuing priority length strategy at the intersection; otherwise, implementing an intersection signal control scheme obtained by a reinforcement learning control model at the intersection;
and 6, storing the road traffic state information collected by the intersection intelligent bodies in the target area, the signal control schemes corresponding to the intersections and the rewards of the intersection intelligent bodies and the environment interaction into databases corresponding to the intersections in real time, updating the reinforcement learning control model parameters corresponding to the intersections when the data information stored in the intersection database is judged to be accumulated to a set size, emptying all data in the database after the updating is finished, and returning to the step 2.
Furthermore, the road traffic state information in the step 2 is a set formed by a vehicle planning path matrix, a vehicle position matrix, a corresponding relation vector of a lane and a road section and a green time vector, and the overall state and trend of regional traffic can be effectively grasped;
distribution for vehicle planned path matrix m×n×4 Representing, wherein each row corresponds to one lane, the lanes in the monitoring range of the intelligent agent are separated by 1 meter to obtain a plurality of cells, and each column corresponds to one cell; if a vehicle exists in the kth cell of the lane i at the time t, storing four planned road segment numbers which may be passed by the vehicle after the time t in Distribution (i, k,1), Distribution (i, k,2), Distribution (i, k,3) and Distribution (i, k,4) respectively;
pos for the vehicle position matrix m×n×1 Representing, wherein each row corresponds to a lane within the agent monitoring range; each column corresponds to one cell; if a vehicle exists in the kth cell of the lane i at the time t, Pos (i, k) is 1; if no vehicle exists in the kth cell of the lane i at the time t, Pos (i, k) is 0;
the corresponding relation vector of the lane and the road section is I m×1 Represents; then I i A link number indicating the lane i is located;
the green time vector is G m×1 Represents; then G is i Representing the remaining green light transit time of the current cycle in which lane i is located at time t.
Further, the calculation formula of the distance factor in step 5 is as follows:
wherein γ is a distance factor;an intersection signal control scheme obtained for a reinforcement learning control model;for queue length priority policy generationAnd (3) forming a signal control scheme.
Further, in step 6, the road traffic state information collected by the intersection intelligent agents in the target area, the signal control schemes corresponding to the intersections respectively and rewards for interaction between the intersection intelligent agents and the environment<s t ,a t ,r t+1 ,s t+1 >Storing the form of the data into a database corresponding to each cross port; wherein s is t Representing road traffic state information collected by the intelligent agent at the intersection at the moment t; a is t Representing a signal control scheme implemented at the intersection at the time t; r is t+1 A reward representing interaction of the intersection agent with the environment at time t + 1; s t+1 And the road traffic state information collected by the intersection intelligent agent at the moment t +1 is represented.
Further, the reward of interaction between the intelligent agent and the environment at the intersection is obtained by calculating the waiting time of the first vehicle at the intersection entrance lane, the queuing length of the intersection entrance lane and the queuing length of the intersection exit lane, and specifically comprises the following steps:
in the formula, r t+1 Awarding interaction between the intelligent agent at the intersection and the environment at the time t + 1; l in 、l out Respectively an inlet channel and an outlet channel of the intersection; w is a i 、q i Respectively the first vehicle waiting time and the queuing length of the lane i; f. of j Is a Boolean variable used for measuring whether the queuing length of the outlet passage exceeds the length L of the road section j Three fourths of the total area of the coating, ifThen f is j 1, otherwise f j =0;L j Is the road segment length of lane j; q. q.s j The length of the queue for lane j; 0 is a penalty factor.
Further, step 6, when the data information stored in the intersection database is accumulated to the set size, updating the reinforcement learning control model parameter corresponding to the intersection, and emptying all data in the database after the update is completed, where the method includes:
step 6.1, initializing reinforcement learning control model parameters, including:
initializing values of the hyper-parameters, including a learning rate alpha, a threshold sigma of a distance factor and a penalty factor delta;
is an action model Actor θ Parameter(s) and evaluation model Critic w The parameters of (a) are assigned with initial values, wherein theta and w are respectively parameters of the action model to be updated and the evaluation model;
define Actor _ old θ′ 、Critic_old w′ Is Actor θ 、Critic w Copies of the model, Actor _ old θ′ The parameters of the model are equal to Actor θ Parameters when not updated are kept unchanged in the updating process;
is Actor θ And Critic w Setting training times n _ operator and n _ critical;
step 6.2, utilize all data x in the database t =<s t ,a t ,r t+1 ,s t+1 Update an action model in a reinforcement learning control model, comprising:
step 6.21, calculate A(s) t ,a t )=r t+1 +τV w (s t+1 )-V w (s t )
In the formula, V w (s t+1 ) To evaluate the model Critic w The evaluation result is output at the moment t + 1; v w (s t ) The evaluation result is output by the evaluation model at the time t; τ is a reduction factor; a(s) t ,a t ) For information on road traffic conditions s t Lower implementation Signal control scheme a t The advantages of (a);
step 6.22, calculating Actor θ Gradient of the model:
wherein E represents a mathematical expectation; (s) t ,a t )~π θ′ To representThe data used is by Actor _ old θ′ Obtaining a model; p θ (a t |s t )、P θ′ (a t |s t ) Is an action model Actor θ 、Actor_old θ′ Information s on road traffic conditions t Lower implementation Signal control scheme a t The probability of (d);representing the derivation of the parameter theta;
6.23, updating the parameter theta according to an Adam optimization method;
step 6.24, repeat step 6.22-step 6.23n _ actor times;
step 6.3, all data x in the database are utilized t =<s t ,a t ,r t+1 ,s t+1 The method for updating the evaluation model in the reinforcement learning control model comprises the following steps:
step 6.31, calculate A(s) t ,a t )=r t+1 +τV w′ (s t+1 )-V w (s t );
In the formula, V w′ (s t+1 ) Represents the evaluation model Critic _ old w′ The output result of (2);
step 6.32, calculating an evaluation model Critic w Gradient (2):
in the formula (I), the compound is shown in the specification,representing the derivation of the parameter w;
step 6.33, updating the parameter w according to an Adam optimization method;
step 6.34, repeat step 6.31-step 6.33n _ critical times;
and 6.4, emptying all data information in the database.
Has the advantages that: compared with the prior art, the technical scheme of the invention has the following beneficial technical effects:
the intersection and the adjacent intersections in the target area are taken as monitoring objects, and regional vehicle planning path data under the Internet of vehicles environment are considered in the state variables, so that the road traffic state is represented more comprehensively; constructing a regional signal control reinforcement learning model by combining a PPO algorithm and an LSTM algorithm; the distance factor is provided to measure the distance between the reinforcement learning model and the control scheme generated by the traditional queuing length priority strategy, so that the influence of the untrained mature model on road traffic safety and traffic efficiency due to the generation of an improper signal control scheme in the online learning process can be effectively avoided.
Drawings
FIG. 1 is a schematic diagram of modeling a lane condition, under an embodiment;
FIG. 2 is a diagram of a road traffic condition result for a lane under one embodiment;
FIG. 3 is a schematic illustration of an intersection in one embodiment;
FIG. 4 is a schematic structural diagram of a PPO2 model according to an embodiment;
FIG. 5 is a logic flow diagram of a method of the present invention in one embodiment.
Detailed Description
The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.
Referring to fig. 5, the method of the present invention is further illustrated by taking an intersection as an example. The invention relates to a reinforcement learning area signal control method based on a vehicle planned path, which specifically comprises the following steps:
(1) designing a control framework of an intelligent agent in regional traffic signal control, modeling a road traffic state, and comprising the following steps:
under the environment of the Internet of vehicles, the regional path and the vehicle position information of the vehicle are fully utilized, so that the road traffic state is more comprehensively grasped and analyzed.
Specifically, each intersection is used as an independent agent, the intersection and the entrance lane of the adjacent intersection are used as observation ranges, and planning path information and vehicle position information of vehicles in the ranges are collected.
Defining the regional vehicle planning path matrix as Distribution m×n×4 Each row corresponds to one lane, and the lanes in the intelligent agent monitoring range are divided into 1 meter of cells; if a vehicle exists in the kth cell of the lane i at the time t, storing four planned road segment numbers after the time t of the vehicle in Distribution (i, f,1), Distribution (i, k,2), Distribution (i, k,3) and Distribution (i, k,4) respectively;
defining a vehicle position matrix as Pos m×n×1 Each row of the matrix corresponds to a lane in the monitoring range, the lane takes 1 meter as a cell, and if a vehicle exists in the kth cell of the lane i at the time t, Pos (i, k) is 1;
definition I m×1 As a vector of the cross-relations of lanes and road sections, I i A link number indicating the lane i is located;
definition G m×1 Is the green time vector, G i Indicating the remaining green light transit time of lane i in the current cycle at time t.
Traffic environment state s is defined by Distribution m×n×4 、Pos m×n×1 、I m×1 And G m×1 The formed set can effectively control the overall state and trend of the regional traffic.
Under the present embodiment, referring to fig. 1, the road traffic state result Distribution of lane i i×n×4 、Pos i×n×1 、I i×1 And G i×1 As shown with reference to FIG. 2;
(2) constructing a reinforcement learning control model, defining the input and the output of the model, and comprising the following steps:
referring to fig. 4, the reinforcement learning control model adopts a distributed control mode, and an agent needs to give a phase scheme and a timing scheme of an intersection at the same time; each intersection is used as an independent agent, and a reinforcement learning control model is independently trained; taking all the entrances of the intersection and the adjacent intersections as monitoring ranges, each intelligent agent collects the path information and the position information of all vehicles at the intersection in real time, and simultaneously obtains the path information and the position information of all vehicles at the entrances of the adjacent intersections and the remaining green time of each lane from the adjacent intersections as the input of a PPO2 algorithm.
The PPO2 model comprises an action model Actor and an evaluation model Critic, and the action model outputs a signal control scheme a; the evaluation model outputs an evaluation v of the signal control scheme a.
To improve the efficiency of training, Actor and Critic will share the underlying input layer. Meanwhile, the PPO2 algorithm is combined with a long and short memory model to enhance the memory of the model to the historical state, so that the intelligent agent can make more reasonable decisions.
(3) The intelligent agent interacts with the road traffic environment, and specifically comprises the following steps: at time t, the agent reads the road traffic status s t Outputting a signal scheme at the time t according to the PPO2 control modelThe release phase and green time at this intersection are given.
The phase set is defined as the combination of all non-conflicted traffic flows at the intersection; for example, for a typical crossroad with independent entry lanes for each flow direction, the action set is defined as { north-south straight, north-south left turn, east-west straight, east-west left turn, south-north straight left, north-east straight left, east-west straight left }, and the duration of each signal phase execution is not fixed.
Signal control scheme generated by agent at time tSignal control scheme generated by queue length priority strategy instead of direct signal controlThe distance factor gamma is calculated. The queuing length priority strategy means that the intersection always gives priority to the phase green light time with the longest queuing length; the distance factor gamma is used for measuring the signal control scheme of the PPO2 control model outputSignaling in conjunction with "queue length priority policy generationScheme for makingThe formula is as follows:
when gamma is larger than a certain threshold value, the intersection signal control scheme generated by the queuing priority length strategy is actually implemented at the intersection at the time t, namelyOtherwise, implementing an intersection signal control scheme obtained by the PPO2 control model at the intersection, namelyAction a t After implementation, the traffic environment enters the next state s t+1 。
Under the present embodiment, the signal control scheme generated by the DPPO2 control modelSignal control scheme generated by queuing length priority strategyFor example, there are
If the threshold value sigma is 6, implementing an intersection signal control scheme obtained by a PPO2 control model at the intersection, namelyTo a t The conflict-free processing is carried out by t =[15,0,0,0,0,0,0,0]。
(4) Designing an incentive r for interaction of an intelligent agent and the environment, aiming at reducing the queuing length of an intersection, reducing vehicle delay and avoiding downstream traffic jam, wherein the incentive is defined as the weighted sum of the queuing length weighted by the waiting time of the first vehicle in an entrance way and the queuing length index in an exit way, namely: reward is the waiting time w of the first vehicle, the inlet lane queuing length q + delta, the outlet lane queuing length index f:
in the formula I in 、l o8t Respectively an inlet channel and an outlet channel of the intersection; w is a i 、q i Respectively the first vehicle waiting time and the queuing length of the lane i; f. of j Is a Boolean variable used for measuring whether the queuing length of the outlet passage exceeds the length L of the road section j Three fourths of the total area of the coating, ifF is then j 1, otherwise f j =0;L j Is the road segment length of lane j; q. q.s j The length of the queue for lane j; delta is a penalty factor.
In the present embodiment, with reference to figure 3,
the queue length q of each entrance lane of the intersection is [20,14,32,20,15,24,20,15,20,26,18,18,12,30 ];
the waiting time w of the first vehicle at each entrance lane is [25,25,15,15,0,0,36,25,25,15,15,0,0,36 ];
each outlet channel queue length p is [5,22,12,14,118,34,12,18,18,10,5,24,5,13], and a boolean variable f is obtained by conversion [0,0,0,0,1,0,0,0,0,0,0,0 ];
taking δ as 100 and converting w into the unit of hour, the following is obtained:
r≈-16.83-100*1=-116.83
(5) and storing the data of the interaction between the intelligent agent and the environment into a database for Replay buffer. Data to<s t ,a t ,r t+1 ,s t+1 >When the database is accumulated to the set size Z, the PPO2 model parameters are updated, the steps are as follows:
(5.1) initializing reinforcement learning control model parameters, including:
initializing values of the hyper-parameters, including a learning rate α of 0.0001; the threshold σ of the distance factor is 6; the penalty factor delta is 100, and Z is 512;
is an action model Actor θ Parameter(s) and evaluation model Critic w The parameters of (a) are assigned with initial values, wherein theta and w are respectively parameters of the action model to be updated and the evaluation model;
define Actor _ old θ′ 、Critic_old w′ Is Actor θ 、Critic w Copy of the model, Actor _ old θ′ The parameters of the model are equal to Actor θ Parameters when not updated are kept unchanged in the updating process;
is Actor θ And Critic w Setting the training times n _ operator to be 10 and n _ critical to be 10;
(5.2) use of all data x in the database t =<s t ,a t ,r t+1 ,s t+1 >. update the action model in the PPO2 control model, including:
(5.21) calculation of A(s) t ,a t )=r t+1 +τV w (s t+1 )-V w (s t )
In the formula, V w (s t+1 ) To evaluate the model Critic w The evaluation result is output at the moment t + 1; v w (s t ) The evaluation result is output by the evaluation model at the time t; τ is a reduction factor; a(s) t ,a t ) For information on road traffic conditions s t Lower implementation Signal control scheme a t The advantages of (a);
(5.22) calculating Actor θ Gradient of the model:
wherein E represents a mathematical expectation; (s) t ,a t )~π θ′ Data indicating use is by Actor _ old θ′ Obtaining a model; p is θ (a t |s t )、P θ′ (a t |s t ) Is an action model Actor θ 、Actor_old θ′ Information s on road traffic conditions t Lower implementation Signal control scheme a t The probability of (d);representing the derivation of the parameter theta;
(5.23) updating the parameter theta according to an Adam optimization method;
(5.24) repeating step (5.22) -step (5.23)10 times;
(5.3) use of all data x in the database t =<s t ,a t ,r t+1 ,s t+1 >. update the evaluation model in the PPO2 control model, including:
(5.31) calculation of A(s) t ,a t )=r t+1 +τV w′ (s t+1 )-V w (s t );
In the formula, V w′ (s t+1 ) Represents the evaluation model Critic _ old w′ The output result of (1);
(5.32) calculation of evaluation model Critic w Gradient (2):
in the formula (I), the compound is shown in the specification,representing the derivation of the parameter w;
(5.33) updating the parameter w according to the Adam optimization method;
(5.34) repeating step (5.31) -step (5.33)10 times;
(5.4) emptying the Replay buffer and repeating the steps (3) - (5).
Claims (4)
1. A reinforcement learning area signal control method based on a vehicle planned path is characterized by comprising the following steps:
step 1, designing a control framework of an intelligent agent in traffic signal control of a target area, and modeling a road traffic state, wherein the control framework comprises the following steps: taking each intersection in the target area as an independent intelligent agent, and constructing a respective corresponding reinforcement learning control model and a database for each independent intelligent agent;
step 2: enabling an independent intelligent agent at the intersection to interact with the environment of the intersection, and collecting road traffic state information within a certain range of the intersection in real time; the certain range includes the intersection and an entrance lane of an adjacent intersection;
the road traffic state information is a set formed by a vehicle planning path matrix, a vehicle position matrix, a corresponding relation vector of a lane and a road section and a green time vector;
distribution for vehicle planned path matrix m×n×4 Representing, wherein each row corresponds to one lane, the lanes in the monitoring range of the intelligent agent are separated by 1 meter to obtain a plurality of cells, and each column corresponds to one cell; if a vehicle exists in the kth cell of the lane i at the time t, storing four planned road segment numbers which may be passed by the vehicle after the time t in Distribution (i, k,1), Distribution (i, k,2), Distribution (i, k,3) and Distribution (i, k,4) respectively;
pos for the vehicle position matrix m×n×1 Representing, wherein each row corresponds to a lane within the agent monitoring range; each column corresponds to one cell; if a vehicle exists in the kth cell of the lane i at the time t, Pos (i, k) is 1; if no vehicle exists in the kth cell of the lane i at the time t, Pos (i, k) is 0;
the corresponding relation vector of the lane and the road section is I m×1 Represents; then I i A link number indicating the lane i is located;
the green time vector is G m×1 Represents; then G is i Representing the remaining green light passing time of the current period of the lane i at the moment t;
step 3, taking the road traffic state information of the intersection at the current moment as the input of a reinforcement learning control model corresponding to the intersection to obtain an intersection signal control scheme at the next moment of the current moment and an evaluation result of the control scheme; the signal control scheme comprises a release phase and a green time;
step 4, generating an intersection signal control scheme at the next moment of the current moment by using a queuing priority length strategy according to the road traffic state information of the intersection at the current moment;
step 5, calculating a distance factor by using an intersection signal control scheme obtained by the reinforcement learning control model and an intersection signal control scheme generated by a queuing priority length strategy; if the calculated distance factor is larger than the set distance threshold, implementing an intersection signal control scheme generated by a queuing priority length strategy at the intersection; otherwise, implementing an intersection signal control scheme obtained by a reinforcement learning control model at the intersection;
the calculation formula of the distance factor is as follows:
wherein gamma is a distance factor;an intersection signal control scheme obtained for a reinforcement learning control model;a signal control scheme generated for the queuing length priority strategy;
and 6, storing the road traffic state information collected by the intersection intelligent bodies in the target area, the signal control schemes corresponding to the intersections and the rewards of the intersection intelligent bodies and the environment interaction into databases corresponding to the intersections in real time, updating the reinforcement learning control model parameters corresponding to the intersections when the data information stored in the intersection database is judged to be accumulated to a set size, emptying all data in the database after the updating is finished, and returning to the step 2.
2. The signal control method for the reinforcement learning area based on the vehicle planned path as claimed in claim 1, wherein the road traffic state information collected by the intersection intelligent agents in the target area, the signal control schemes corresponding to the intersections respectively and the reward of interaction between the intersection intelligent agents and the environment in the step 6 are provided<s t ,a t ,r t+1 ,s t+1 >Storing the form of the data into a database corresponding to each cross port; wherein s is t Representing road traffic state information collected by the intelligent body at the intersection at the time t; a is t Representing a signal control scheme implemented at the intersection at the time t; r is t+1 A reward representing interaction of the intersection agent with the environment at time t + 1; s t+1 And the road traffic state information collected by the intersection intelligent agent at the moment t +1 is represented.
3. The reinforcement learning area signal control method based on the vehicle planned path according to claim 1, wherein the reward for interaction between the intersection intelligent agent and the environment is calculated from the first vehicle waiting time of an intersection entrance lane, the intersection entrance lane queuing length and the intersection exit lane queuing length, and specifically comprises the following steps:
in the formula, r t+1 Awarding interaction between the intelligent agent at the intersection and the environment at the time t + 1; l in 、l out Respectively an inlet channel and an outlet channel of the intersection; w is a i 、q i Respectively the first vehicle waiting time and the queuing length of the lane i; f. of j Is a Boolean variable used for measuring whether the queuing length of the outlet passage exceeds the length L of the road section j Three fourths of the total area of the coating, ifF is then j 1, otherwise f j =0;L j Is the road segment length of lane j; q. q.s j Of lane jThe length of the queue; delta is a penalty factor.
4. The reinforcement learning area signal control method based on the vehicle planned path according to claim 2, wherein in step 6, when the data information stored in the intersection database is accumulated to a set size, the reinforcement learning control model parameters corresponding to the intersection are updated, and all data in the database are emptied after the update is completed, and the method comprises:
step 6.1, initializing reinforcement learning control model parameters, including:
initializing values of the hyper-parameters, including a learning rate alpha, a threshold sigma of a distance factor and a penalty factor delta;
is an action model Actor θ Parameter(s) and evaluation model Critic w The parameters of (a) are assigned with initial values, wherein theta and w are respectively parameters of the action model to be updated and the evaluation model;
define Actor _ old θ′ 、Critic_old w′ Is Actor θ 、Critic w Copies of the model, Actor _ old θ′ The parameters of the model are equal to Actor θ Parameters when not updated are kept unchanged in the updating process;
is Actor θ And Critic w Setting training times n _ operator and n _ critical;
step 6.2, all data x in the database are utilized t =<s t ,a t ,r t+1 ,s t+1 >Updating the action model in the reinforcement learning control model, comprising:
step 6.21, calculate A(s) t ,a t )=r t+1 +τV w (s t+1 )-V w (s t )
In the formula, V w (s t+1 ) To evaluate the model Critic w The evaluation result is output at the moment t + 1; v w (s t ) The evaluation result is output by the evaluation model at the time t; τ is a reduction factor; a(s) t ,a t ) For information s on road traffic conditions t Lower implementation Signal control scheme a t The advantages of (a);
step 6.22, calculating Actor θ Gradient of the model:
wherein E represents a mathematical expectation; (s) t ,a t )~π θ′ Indicates that the data used is by Actor _ old θ′ Obtaining a model; p θ (a t |s t )、P θ′ (a t |s t ) Is an action model Actor θ 、Actor_old θ′ Information s on road traffic conditions t Lower implementation Signal control scheme a t The probability of (d);representing the derivation of the parameter theta;
6.23, updating the parameter theta according to an Adam optimization method;
step 6.24, repeat step 6.22-step 6.23n _ actor times;
step 6.3, all data x in the database are utilized t =<s t ,a t ,r t+1 ,s t+1 >Updating an evaluation model in the reinforcement learning control model, comprising:
step 6.31, calculate A(s) t ,a t )=r t+1 +τV w′ (s t+1 )-V w (s t );
In the formula, V w′ (s t+1 ) Represents the evaluation model Critic _ old w′ The output result of (1);
step 6.32, calculating an evaluation model Critic w Gradient of (a):
in the formula (I), the compound is shown in the specification,representing the derivation of the parameter w;
step 6.33, updating the parameter w according to an Adam optimization method;
step 6.34, repeat step 6.31-step 6.33n _ critical times;
and 6.4, emptying all data information in the database.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110534127.1A CN113487902B (en) | 2021-05-17 | 2021-05-17 | Reinforced learning area signal control method based on vehicle planned path |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110534127.1A CN113487902B (en) | 2021-05-17 | 2021-05-17 | Reinforced learning area signal control method based on vehicle planned path |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113487902A CN113487902A (en) | 2021-10-08 |
CN113487902B true CN113487902B (en) | 2022-08-12 |
Family
ID=77933576
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110534127.1A Active CN113487902B (en) | 2021-05-17 | 2021-05-17 | Reinforced learning area signal control method based on vehicle planned path |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113487902B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114550470B (en) * | 2022-03-03 | 2023-08-22 | 沈阳化工大学 | Wireless network interconnection intelligent traffic signal lamp |
CN114667852B (en) * | 2022-03-14 | 2023-04-14 | 广西大学 | Hedge trimming robot intelligent cooperative control method based on deep reinforcement learning |
CN116092297B (en) * | 2023-04-07 | 2023-06-27 | 南京航空航天大学 | Edge calculation method and system for low-permeability distributed differential signal control |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
RU2379761C1 (en) * | 2008-07-01 | 2010-01-20 | Государственное образовательное учреждение высшего профессионального образования "Уральский государственный технический университет УПИ имени первого Президента России Б.Н.Ельцина" | Method of controlling road traffic at intersection |
CN105046987A (en) * | 2015-06-17 | 2015-11-11 | 苏州大学 | Pavement traffic signal lamp coordination control method based on reinforcement learning |
CN111915894A (en) * | 2020-08-06 | 2020-11-10 | 北京航空航天大学 | Variable lane and traffic signal cooperative control method based on deep reinforcement learning |
CN112365724A (en) * | 2020-04-13 | 2021-02-12 | 北方工业大学 | Continuous intersection signal cooperative control method based on deep reinforcement learning |
CN112632858A (en) * | 2020-12-23 | 2021-04-09 | 浙江工业大学 | Traffic light signal control method based on Actor-critical frame deep reinforcement learning algorithm |
-
2021
- 2021-05-17 CN CN202110534127.1A patent/CN113487902B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
RU2379761C1 (en) * | 2008-07-01 | 2010-01-20 | Государственное образовательное учреждение высшего профессионального образования "Уральский государственный технический университет УПИ имени первого Президента России Б.Н.Ельцина" | Method of controlling road traffic at intersection |
CN105046987A (en) * | 2015-06-17 | 2015-11-11 | 苏州大学 | Pavement traffic signal lamp coordination control method based on reinforcement learning |
CN112365724A (en) * | 2020-04-13 | 2021-02-12 | 北方工业大学 | Continuous intersection signal cooperative control method based on deep reinforcement learning |
CN111915894A (en) * | 2020-08-06 | 2020-11-10 | 北京航空航天大学 | Variable lane and traffic signal cooperative control method based on deep reinforcement learning |
CN112632858A (en) * | 2020-12-23 | 2021-04-09 | 浙江工业大学 | Traffic light signal control method based on Actor-critical frame deep reinforcement learning algorithm |
Non-Patent Citations (2)
Title |
---|
Compatibility-Based Approach for Routing and Scheduling the Demand Responsive Connector;YUNXUE LU,HAO WANG;《IEEE Access 》;20200526;第8卷;101770 - 101783 * |
基于多智能体的城市交通区域协调控制方法;黄艳国等;《武汉理工大学学报(交通科学与工程版)》;20100415;第34卷(第02期);197-200 * |
Also Published As
Publication number | Publication date |
---|---|
CN113487902A (en) | 2021-10-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113487902B (en) | Reinforced learning area signal control method based on vehicle planned path | |
CN111696370A (en) | Traffic light control method based on heuristic deep Q network | |
CN103593535A (en) | Urban traffic complex self-adaptive network parallel simulation system and method based on multi-scale integration | |
CN113436443B (en) | Distributed traffic signal control method based on generation of countermeasure network and reinforcement learning | |
Lin et al. | Traffic signal optimization based on fuzzy control and differential evolution algorithm | |
CN113299078B (en) | Multi-mode traffic trunk line signal coordination control method and device based on multi-agent cooperation | |
CN106781465A (en) | A kind of road traffic Forecasting Methodology | |
Han et al. | Leveraging reinforcement learning for dynamic traffic control: A survey and challenges for field implementation | |
CN113112823A (en) | Urban road network traffic signal control method based on MPC | |
CN116513273A (en) | Train operation scheduling optimization method based on deep reinforcement learning | |
CN113362618B (en) | Multi-mode traffic adaptive signal control method and device based on strategy gradient | |
Tamimi et al. | Intelligent traffic light based on genetic algorithm | |
Shao et al. | Machine learning enabled traffic prediction for speed optimization of connected and autonomous electric vehicles | |
CN115762128B (en) | Deep reinforcement learning traffic signal control method based on self-attention mechanism | |
CN116758765A (en) | Multi-target signal control optimization method suitable for multi-mode traffic | |
CN114627658B (en) | Traffic control method for major special motorcade to pass through expressway | |
CN111311905A (en) | Particle swarm optimization wavelet neural network-based expressway travel time prediction method | |
CN114701517A (en) | Multi-target complex traffic scene automatic driving solution based on reinforcement learning | |
Cenedese et al. | A novel control-oriented cell transmission model including service stations on highways | |
Zhong et al. | Deep Q-Learning Network Model for Optimizing Transit Bus Priority at Multiphase Traffic Signal Controlled Intersection | |
Wei et al. | Intersection signal control approach based on pso and simulation | |
Guo et al. | Network Multi-scale Urban Traffic Control with Mixed Traffic Flow | |
Tan et al. | Optimization of signalized traffic network using swarm intelligence | |
CN116189464B (en) | Cross entropy reinforcement learning variable speed limit control method based on refined return mechanism | |
Miletić et al. | Impact of Connected Vehicles on Learning based Adaptive Traffic Control Systems |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |