CN112614343B - Traffic signal control method and system based on random strategy gradient and electronic equipment - Google Patents

Traffic signal control method and system based on random strategy gradient and electronic equipment Download PDF

Info

Publication number
CN112614343B
CN112614343B CN202011459044.2A CN202011459044A CN112614343B CN 112614343 B CN112614343 B CN 112614343B CN 202011459044 A CN202011459044 A CN 202011459044A CN 112614343 B CN112614343 B CN 112614343B
Authority
CN
China
Prior art keywords
network
value
time
traffic
phase
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011459044.2A
Other languages
Chinese (zh)
Other versions
CN112614343A (en
Inventor
郑培余
陶刚
陈波
李志斌
陈冰
杨光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Duolun Technology Corp ltd
Original Assignee
Duolun Technology Corp ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Duolun Technology Corp ltd filed Critical Duolun Technology Corp ltd
Priority to CN202011459044.2A priority Critical patent/CN112614343B/en
Publication of CN112614343A publication Critical patent/CN112614343A/en
Priority to PCT/CN2021/124593 priority patent/WO2022121510A1/en
Application granted granted Critical
Publication of CN112614343B publication Critical patent/CN112614343B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/01Detecting movement of traffic to be counted or controlled
    • G08G1/0104Measuring and analyzing of parameters relative to traffic conditions
    • G08G1/0125Traffic data processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/10Geometric CAD
    • G06F30/18Network design, e.g. design based on topological or interconnect aspects of utility systems, piping, heating ventilation air conditioning [HVAC] or cabling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/01Detecting movement of traffic to be counted or controlled
    • G08G1/0104Measuring and analyzing of parameters relative to traffic conditions
    • G08G1/0137Measuring and analyzing of parameters relative to traffic conditions for specific applications
    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/07Controlling traffic signals
    • G08G1/08Controlling traffic signals according to detected number or speed of vehicles
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Geometry (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Traffic Control Systems (AREA)

Abstract

The invention discloses a traffic signal control method, a system and electronic equipment based on a random strategy gradient, wherein the method comprises the following steps: obtaining static road network data of at least one control signalized intersection; visually drawing a traffic simulation road network according to the static road network data; acquiring real-time traffic running state data of at least one control signalized intersection; performing parameter checking on simulation parameters in the traffic simulation road network according to the traffic running state data to obtain an optimized traffic simulation road network; inputting the traffic state obtained by observing the optimized traffic simulation road network into a value network to obtain an evaluation value of each signal control scheme under the traffic state; inputting the traffic state into a policy network to obtain a probability value of each signal control scheme; and updating parameters of the strategy network through random strategy gradients based on the evaluation value of each signal control scheme in the traffic state and one signal control scheme. The method provided by the invention can solve the problem of dimension explosion of signal control.

Description

Traffic signal control method and system based on random strategy gradient and electronic equipment
Technical Field
The invention relates to the technical field of intelligent traffic, in particular to a traffic signal control method, a system and electronic equipment based on a random strategy gradient.
Background
In order to meet the rapid increase of urban traffic demand, not only road infrastructure needs to be newly built to improve the overall traffic capacity of a city, but also the existing traffic infrastructure of the city needs to be improved, and the traffic efficiency of the existing road is improved through an intelligent traffic management and control technology. The intersection is used as a key node of an urban road network and is also one of research hotspots for urban intelligent traffic control. The urban intelligent traffic control system takes the intersection as a real-time controllable system, and continuously monitors, diagnoses, models and controls the intersection in real time. However, the traditional intersection signal control system based on the fixed timing scheme cannot adapt to the nonlinearity, randomness, ambiguity and uncertainty of a traffic system.
The self-adaptive traffic signal control system can respond to dynamic change of traffic flow in time, and optimize a signal timing scheme for real-time control. However, the existing adaptive traffic signal control system has the following limitations: (1) the problem of dimension explosion of a plurality of intersections is solved simultaneously; (2) lack of an accurate traffic model framework to represent the dynamics and randomness of traffic flow in response to optimal changes in signal control; (3) the failure of the detector and the communication failure greatly affect the stability of the system.
Reinforcement learning is an unsupervised machine learning method, and can directly interact with a traffic simulation road network to learn a control strategy. The intelligent agent obtains a state by observing the simulation road network, selects an action from the action set based on the strategy function, after the action is executed, the simulation road network feeds back a reward supervision signal for evaluating the quality of the selected action, meanwhile, the simulation road network is updated to the next state, and the intelligent agent repeats the process until one epigode is finished to obtain the maximum accumulated reward. Therefore, the adaptive traffic signal control based on reinforcement learning can adapt to the dynamic property and the randomness of a traffic system, and has obvious advantages compared with the traditional signal control and induction signal control based on a fixed timing scheme. However, the actions of the traditional reinforcement learning such as Q learning are selected based on the Q value in the Q table, and the method has the disadvantages that only limited state-action pairs can be processed, a huge state space cannot be processed, and the problem of dimension explosion occurs due to the overlarge state space, so that the efficiency of strategy learning is low, the accuracy is low, and the like.
Disclosure of Invention
The invention provides a traffic signal control method, system and electronic equipment based on random strategy gradients, and aims to solve the problems that in the prior art, limited state-action pairs can be processed by using traditional reinforcement learning to control traffic signals, huge state space cannot be processed, and dimension explosion can occur due to overlarge state space, so that the efficiency of strategy learning is low and the accuracy is low.
According to a first aspect, the invention provides a traffic signal control method based on a random strategy gradient, comprising the following steps:
obtaining static road network data of at least one control signalized intersection;
visually drawing a traffic simulation road network according to the static road network data;
acquiring real-time traffic running state data of at least one control signalized intersection;
performing parameter checking on simulation parameters in the traffic simulation road network according to the traffic operation state data to obtain an optimized traffic simulation road network;
inputting the traffic state obtained by observing the optimized traffic simulation road network into a value network to obtain an evaluation value of each signal control scheme under the traffic state, and updating parameters of the value network by adopting a time difference algorithm; the value network is a pre-constructed convolutional neural network and is used for approximating an action value function;
inputting the traffic state into a policy network to obtain a probability value of each signal control scheme, and randomly sampling according to the probability value of each signal control scheme to obtain a signal control scheme; the strategy network is a pre-constructed convolutional neural network and is used for approximating a strategy function;
and updating parameters of the strategy network through random strategy gradients based on the evaluation value of each signal control scheme in the traffic state and one signal control scheme.
Optionally, the step of obtaining the optimized traffic simulation road network by checking the simulation parameters in the traffic simulation road network according to the traffic operation state data, where the traffic operation state data includes a headway and an acceleration/deceleration of a vehicle, includes:
acquiring the value ranges of the actual vehicle headway time distance parameter and the vehicle acceleration and deceleration parameter, and preliminarily checking the vehicle headway time distance parameter and the vehicle acceleration and deceleration parameter in the traffic simulation road network according to the value ranges;
observing a traffic simulation road network, acquiring a simulation headway and a vehicle acceleration and deceleration, and comparing and analyzing the simulation headway and the vehicle acceleration and deceleration with an actual headway and the vehicle acceleration and deceleration;
if the difference between the simulated headway and the vehicle acceleration and deceleration and the difference between the actual headway and the vehicle acceleration and deceleration are within a preset range, ending parameter checking to obtain an optimized traffic simulation road network; otherwise, repeating the steps until the difference is within the preset range.
Optionally, the static road network data includes part or all of road grade, number of lanes, lane width, lane function division, road length, road marking, intersection type, adjacent intersection information, signal equipment number, phase information, and phase sequence information;
the traffic operation state data may further include a device ID, a detection time, a traffic volume, a vehicle type distribution, a vehicle time occupancy, a vehicle space occupancy, a vehicle speed, a vehicle length, a vehicle head interval, a queuing length, and a number of times of parking, in part or in whole.
Optionally, the traffic state is expressed as a maximum number of queued vehicles per phase j at each signalized intersection, specifically:
Figure BDA0002830591200000031
in the formula (I), the compound is shown in the specification,
Figure BDA0002830591200000032
representing an observed value of a traffic state of a signalized intersection i at a decision time t, wherein i represents the number of each signalized intersection, and belongs to {1, 2.. N }; j is represented as a phase number, j ∈ {1, 2.. n }; t is decision time; l is a lane number; l is j Indicating a set of lanes that can be traversed at phase j; q. q.s t,l Expressed as the number of vehicles in line on lane i at decision time t;
the number of the vehicles in the line I at the decision time t is equal to the number of the vehicles in the line I at the decision time t-1 plus or minus the number of the vehicles entering the line or leaving the line at the decision time t, and specifically comprises the following steps:
Figure BDA0002830591200000033
in the formula, q t-1,l Expressed as the number of vehicles in line, V, on lane l at decision time t-1 t,l The method comprises the following steps of representing a vehicle set which enters a lane l at a decision time t, representing whether a certain vehicle is added into a queue at the decision time t, judging whether the vehicle is added into the queue or leaves the queue, and specifically comprising the following steps:
Figure BDA0002830591200000034
in the formula (I), the compound is shown in the specification,
Figure BDA0002830591200000035
the speed of the vehicle v at the decision time t-1 and the decision time t; sp Thr A speed threshold for determining whether to join the queue;
the joint state of a plurality of signalized intersections is expressed as a vector of the observed value of each signalized intersection, and specifically comprises the following steps:
Figure BDA0002830591200000036
in the formula (I), the compound is shown in the specification,
Figure BDA0002830591200000037
expressed as the observed value of the ith signalized intersection at decision time t.
Alternatively, the signal control scheme is divided into a fixed phase order action selection and a variable phase order action selection according to whether the order of the phases is changed;
for a fixed phase order, the policy network at signalized intersection i has two optional actions in the decision time t of each iteration, when
Figure BDA0002830591200000041
Time indicates continuation of the current phase; when in use
Figure BDA0002830591200000042
The time indicates ending the current phase and switching to the next phase, specifically:
Figure BDA0002830591200000043
for a variable phase sequence, the policy network at signalized intersection i has n selectable actions at decision time t of each iteration, specifically:
Figure BDA0002830591200000044
if at decision time t, the strategy network at the signalized intersection i decides to continue the current phase, namely
Figure BDA0002830591200000045
The action at the moment t is the same as the action at the previous decision moment p (t), the duration m seconds of the current phase is prolonged, and m is 1 to 5 seconds; the decision time for judging whether to switch the phase at the next time is the time t + m, if the current phase is decided to be ended and switched to the next phase at the time t, the method comprises the steps of
Figure BDA0002830591200000046
That is, the action at the time t is different from the action at the last decision time p (t), and finally a minimum green time G of the phase is set min If the time Y of the yellow lamp of the intermediate phase is released again, the switching to the next phase is started, and the decision time for judging whether the phase is switched or not next time is t + G min Time + Y + m.
Optionally, the evaluation value calculation function of the value network at the signalized intersection i is defined as a reduction value of the maximum number of queued vehicles, and specifically includes:
Figure BDA0002830591200000047
in the formula, L p(t),l And L t,l The number of vehicles queued on lane l at decision time p (t) and decision time t, respectively;
or, defining an evaluation value calculation function of the value network at the signalized intersection i as a reduction value of the maximum total delay, specifically:
Figure BDA0002830591200000048
in the formula, Cd p(t),v And Cd t,v Respectively accumulating the total delay of the queued vehicles at decision time p (t) and decision time t;
the joint evaluation value calculation function of the value network at the plurality of signal intersections is expressed as the coupling of the evaluation value calculation functions of the value network at each signal intersection, and specifically comprises the following steps:
Figure BDA0002830591200000049
wherein J (i) is a set of value networks other than the value network of the signalized intersection i; for the joint reward function, n is a non-negative constant, when n is 0, the value network of the signalized intersection i only considers the evaluation values of the other value network sets J (i), and when n is larger, the value network of the signalized intersection i only considers the local evaluation value of the signalized intersection i.
Optionally, the value network is a dual delay deep Q network, comprising an action value network for selecting an action
Figure BDA0002830591200000051
And a target value network for calculating the Q value
Figure BDA0002830591200000052
Parameter ω ═ ω 1 ,…,ω i ,…,ω N ]Expressed as N parameters of the motion value network, parameter ω '═ ω' 1 ,…,ω′ i ,…,ω′ N ]N parameters expressed as a target value network;
training a value network and a strategy network, comprising the steps of:
1) inputting reinforcement learning related parameters: experience pool capacity max _ size, small batch size batch _ size, discount rate gamma, action value network learning rate alpha, target value network learning rate beta, strategy network learning rate eta and termination iteration number N;
2) initializing element, action value network in experience pool E
Figure BDA0002830591200000053
Parameter omega, target value network of
Figure BDA0002830591200000054
Parameter ω' of (1), policy network
Figure BDA0002830591200000055
The parameter θ of (a);
3) obtaining the observed value of each signalized intersection i at the moment t
Figure BDA0002830591200000056
Formed united traffic state s t Current _ phase of the current phase;
4) when the iteration number i is less than the termination iteration number N, executing the following steps:
41) network according to policy
Figure BDA0002830591200000057
Calculating probability distribution, and randomly sampling according to the probability distribution to obtain signal control scheme
Figure BDA0002830591200000058
42) Control scheme for signal
Figure BDA0002830591200000059
When the current phase state is current _ phase, the current phase is extendedThe time length is m seconds; on-signal control scheme
Figure BDA00028305912000000510
When the current _ phase state is not in the current phase state, releasing a section of minimum green time G of the current phase min After the intermediate phase yellow light time Y is finished, switching to the jth phase;
43) calculating evaluation value of evaluation network at signalized intersection i
Figure BDA00028305912000000511
And constructing a joint evaluation value calculation function
Figure BDA0002830591200000061
And calculating the observed value of each signalized intersection i at the moment t +1
Figure BDA0002830591200000062
Formed united traffic state s t+1
44) When the capacity of the experience pool E is the maximum capacity max _ size, removing the earlier experience at the moment from the experience pool E; otherwise, experience will be
Figure BDA0002830591200000063
Putting the mixture into an experience pool E;
45) when the capacity of the experience pool is larger than the small batch experience quantity batch _ size, the following steps are executed:
451) randomly sampling small batches from an experience pool E according to the experience priority value;
452) for each small batch of experience samples
Figure BDA0002830591200000064
Network for respectively calculating action value at signalized intersection i
Figure BDA0002830591200000065
Value of (C) and value of target value network
Figure BDA0002830591200000066
And obtaining the size of a baseline b value;
453) according to
Figure BDA0002830591200000067
Calculating the value of the loss function and using the Adam optimizer gradient descent method
Figure BDA0002830591200000068
Minimizing a loss function to update the parameter ω;
454) updating the target value network according to ω '═ β ω' + (1- β) ω
Figure BDA0002830591200000069
Parameter ω';
455) for each small lot of experience samples
Figure BDA00028305912000000610
Computing policy network based on Monte Carlo approximation method
Figure BDA00028305912000000611
Random strategy gradient of
Figure BDA00028305912000000612
And using a gradient ascent algorithm
Figure BDA00028305912000000613
Updating the parameter theta;
46) the traffic state s at the moment of t +1 t+1 Is assigned to s t And repeats step 451) to step 455).
According to a second aspect, the present invention provides a traffic signal control system based on a stochastic strategy gradient, comprising:
the first data acquisition module is used for acquiring static road network data of at least one control signalized intersection;
the simulation drawing module is used for drawing a traffic simulation road network according to the static road network data in a visualized manner;
the second data acquisition module is used for acquiring real-time traffic running state data of at least one control signalized intersection;
the simulation checking module is used for performing parameter checking on simulation parameters in the traffic simulation road network according to the traffic running state data to obtain an optimized traffic simulation road network;
the action evaluation module is used for inputting the traffic state obtained by observing the optimized traffic simulation road network into the value network to obtain the evaluation value of each signal control scheme under the traffic state, and updating the parameters of the value network by adopting a time difference algorithm; the value network is a pre-constructed neural network and is used for approximating an action value function;
the action sampling module is used for inputting the traffic state into the policy network to obtain the probability value of each signal control scheme and randomly sampling according to the probability value of each signal control scheme to obtain a signal control scheme; the strategy network is a pre-constructed neural network and is used for approximating a strategy function;
and the signal control module is used for updating parameters of the strategy network through random strategy gradients based on the evaluation value of each signal control scheme and one signal control scheme under the traffic state.
According to a third aspect, the invention provides an electronic device comprising: the traffic signal control method based on the random strategy gradient comprises a memory and a processor, wherein the memory and the processor are mutually connected in a communication mode, the memory stores computer instructions, and the processor executes the computer instructions so as to execute the traffic signal control method based on the random strategy gradient.
According to a fourth aspect, the present invention provides a computer-readable storage medium having stored thereon computer instructions for causing a computer to execute the random policy gradient-based traffic signal control method of the first aspect.
The invention has the beneficial effects that:
1. compared with the prior art in which the parameterized representation of the Q value function is adopted, the parameterized representation of the strategy function is simpler, the convergence is better, the learning efficiency and the accuracy are higher, and the problem of dimension explosion generally does not occur.
2. The traffic signal control method based on the random strategy gradient provided by the invention can adapt to the nonlinearity, randomness, fuzziness and uncertainty of a traffic system by continuously monitoring, diagnosing, modeling and controlling the intersection in real time.
3. According to the traffic signal control method based on the random strategy gradient, the problem that original traffic data and traffic states are too large can be solved by adopting the convolutional neural network in deep learning, the convolutional neural network inputs original high-dimensional data, the bottom-layer features are combined to form more abstract high-layer features, hidden features in the high-dimensional traffic states are captured, control can be directly carried out according to the input high-dimensional data, the feature representation capability of a state input matrix is improved, and the generalization capability of the method on representation of different traffic states is enhanced.
4. Compared with the traditional timing control and induction control, the traffic signal control method based on the random strategy gradient can respond to the dynamic change of the traffic flow in time, optimize the signal timing scheme for real-time control, finally reduce the driving delay of the road network and improve the traffic efficiency of the road network.
Drawings
FIG. 1 is a flow chart of a traffic signal control method based on a stochastic strategy gradient provided by the present invention;
FIG. 2 is a schematic illustration of an exemplary traffic network in an embodiment of the present invention;
FIG. 3 is a flowchart illustrating the detailed steps of step S400 in FIG. 1;
FIG. 4 is a functional block diagram of a traffic signal control system based on stochastic strategy gradients provided by the present invention;
fig. 5 is a schematic diagram of a hardware structure of an electronic device according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the description of the present invention, it should be noted that the terms "first", "second", and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
Example 1
Fig. 3 shows a flow chart of a traffic signal control method based on a random strategy gradient according to an embodiment of the present invention, and as shown in fig. 3, the method may include the following steps:
step S100: and acquiring static road network data of at least one control signalized intersection.
In an embodiment of the present invention, the static road network data includes part or all of road grade, number of lanes, lane width, lane function division, road length, road marking, intersection type, adjacent intersection information, signal device number, phase information, and phase sequence information.
Step S200: and visually drawing a traffic simulation road network according to the static road network data.
In the embodiment of the invention, microscopic traffic simulation software, such as SUMO, can be used for drawing a traffic simulation road network.
In an embodiment of the invention, a traffic simulation road network comprises at least one control signalized intersection. Specifically, as shown in fig. 2, reference numerals 1 to 9 in the drawing denote 9 signal intersections, and the whole shown in fig. 2 is a traffic network.
Step S300: and acquiring real-time traffic running state data of at least one control signalized intersection.
In the embodiment of the present invention, the traffic operation state data may further include part or all of the device ID, the detection time, the traffic flow, the vehicle type distribution, the vehicle time occupancy, the headway, the vehicle acceleration and deceleration, the vehicle space occupancy, the vehicle speed, the vehicle length, the headway, the queuing length, and the parking number.
Step S400: and performing parameter checking on simulation parameters in the traffic simulation road network according to the traffic operation state data to obtain the optimized traffic simulation road network.
In the embodiment of the present invention, one or more of the traffic state data may be used to perform parameter checking on simulation parameters in a traffic simulation road network, where an example using a headway and an acceleration/deceleration of a vehicle is described here, as shown in fig. 3, step S400 may include the following steps:
step S401: and acquiring the value ranges of the actual headway time distance parameter and the vehicle acceleration and deceleration parameter, and preliminarily checking the headway time distance parameter and the vehicle acceleration and deceleration parameter in the traffic simulation road network according to the value ranges.
Step S402: and observing the traffic simulation road network, acquiring the simulation headway and the vehicle acceleration and deceleration, and comparing and analyzing the simulation headway and the vehicle acceleration and deceleration with the actual headway and the actual acceleration and deceleration.
Step S403: and if the difference between the simulated headway and the vehicle acceleration and deceleration and the difference between the actual headway and the vehicle acceleration and deceleration are within a preset range, finishing parameter checking to obtain the optimized traffic simulation road network.
And if the difference between the simulated headway and the vehicle acceleration and deceleration and the difference between the actual headway and the vehicle acceleration and deceleration are not within the preset range, repeating the steps S401 to S403 until the difference is within the preset range.
Step S500: and inputting the traffic state obtained by observing the optimized traffic simulation road network into the value network to obtain the evaluation value of each signal control scheme under the traffic state, and updating the parameters of the value network by adopting a time difference algorithm. In the embodiment of the invention, the value network is a pre-constructed convolutional neural network and is used for approximating the action value function.
In embodiments of the invention, the value network may be a volume for action-only cost functionsThe device comprises a neural network, an input layer, a convolutional layer, a full connection layer and an output layer; it may also be a dual delay deep Q network, including an action value network for selecting actions
Figure BDA0002830591200000091
And a target value network for calculating the Q value
Figure BDA0002830591200000092
Wherein the parameter ω ═ ω 1 ,…,ω i ,…,ω N ]Expressed as N parameters of the motion value network, parameter ω '═ ω' 1 ,…,ω′ i ,…,ω′ N ]Expressed as N parameters of the target value network.
Step S600: and inputting the traffic state into a policy network to obtain the probability value of each signal control scheme, and randomly sampling according to the probability value of each signal control scheme to obtain a signal control scheme. In the embodiment of the invention, the strategy network is a pre-constructed convolutional neural network and is used for approximating the strategy function. In an embodiment of the present invention, a policy network includes an input layer, a convolutional layer, a fully-connected layer, and an output layer.
Step S700: and updating parameters of the strategy network through random strategy gradients based on the evaluation value of each signal control scheme in the traffic state and one signal control scheme.
In the embodiment of the present invention, the traffic state is represented as the maximum number of queued vehicles for each phase j at each signalized intersection, specifically:
Figure BDA0002830591200000101
in the formula (I), the compound is shown in the specification,
Figure BDA0002830591200000102
representing an observed value of a traffic state of a signalized intersection i at a decision time t, wherein i represents the number of each signalized intersection, and belongs to {1, 2.. N }; j is represented as a phase number, j ∈ {1, 2.. n }; t is decision time; l is a lane number; l is j Indicating a set of lanes that can be traversed at phase j; q. q.s t,l Expressed as the number of vehicles in line on lane i at decision time t;
the number of the vehicles in the line I at the decision time t is equal to the number of the vehicles in the line I at the decision time t-1 plus or minus the number of the vehicles entering the line or leaving the line at the decision time t, and specifically comprises the following steps:
Figure BDA0002830591200000103
in the formula, q t-1,l Expressed as the number of vehicles in line, V, on lane l at decision time t-1 t,l The method comprises the following steps of representing a vehicle set which enters a lane l at a decision moment t, representing whether a certain vehicle is added into a queue at the decision moment t, judging whether the vehicle is added into the queue or leaves the queue, and specifically:
Figure BDA0002830591200000104
in the formula (I), the compound is shown in the specification,
Figure BDA0002830591200000105
the speed of the vehicle v at the decision time t-1 and the decision time t; sp Thr A speed threshold for determining whether to join the queue;
the joint state of the signalized intersections is expressed as a vector of the observation value of each signalized intersection, and specifically comprises the following steps:
Figure BDA0002830591200000106
in the formula (I), the compound is shown in the specification,
Figure BDA0002830591200000107
expressed as the observed value at decision time t for the ith signalized intersection.
In the embodiment of the present invention, the signal control scheme is divided into action selection of a fixed phase order and action selection of a variable phase order according to whether the order of the phases is changed;
for a fixed phase order, the policy network at signalized intersection i has two optional actions in the decision time t of each iteration, when
Figure BDA0002830591200000111
Time indicates continuation of the current phase; when the temperature is higher than the set temperature
Figure BDA0002830591200000112
The time indicates ending the current phase and switching to the next phase, specifically:
Figure BDA0002830591200000113
for a variable phase sequence, the policy network at signalized intersection i has n selectable actions at decision time t of each iteration, specifically:
Figure BDA0002830591200000114
if at decision time t, the strategy network at the signalized intersection i decides to continue the current phase, namely
Figure BDA0002830591200000115
The action at the moment t is the same as the action at the previous decision moment p (t), the duration m seconds of the current phase is prolonged, and m is 1 to 5 seconds; the decision time for judging whether to switch the phase at the next time is the time t + m, if the current phase is decided to be ended and switched to the next phase at the time t, the method comprises the steps of
Figure BDA0002830591200000116
That is, the action at the time t is different from the action at the last decision time p (t), and finally, a minimum green light time G of the phase is set min If the time Y of the intermediate phase yellow lamp is released again, the phase is switched to the next phase, and the decision time for judging whether the phase is switched next time is t + G min Time + Y + m.
In the embodiment of the invention, an evaluation value calculation function of the value network at the signalized intersection i is defined as a reduction value of the maximum number of queued vehicles, and specifically comprises the following steps:
Figure BDA0002830591200000117
in the formula, L p(t),l And L t,l The number of vehicles queued on lane l at decision time p (t) and decision time t, respectively;
or, defining an evaluation value calculation function of the value network at the signalized intersection i as a reduction value of the maximum total delay, specifically:
Figure BDA0002830591200000118
in the formula, Cd p(t),v And Cd t,v Respectively accumulating the total delay of the queued vehicles at decision time p (t) and decision time t;
the joint evaluation value calculation function of the value network at a plurality of signal intersections is expressed as the coupling of the evaluation value calculation function of the value network at each signal intersection, and specifically comprises the following steps:
Figure BDA0002830591200000121
wherein J (i) is a set of value networks other than the value network of the signalized intersection i; for the joint reward function, n is a non-negative constant, when n is 0, the value network of the signalized intersection i only considers the evaluation values of the other value network sets J (i), and when n is larger, the value network of the signalized intersection i only considers the local evaluation value of the signalized intersection i.
In the embodiment of the invention, when the value network is a double-delay deep Q network, the specific steps of training the value network and the strategy network are as follows:
1) inputting reinforcement learning related parameters: the method comprises the following steps of empirical pool capacity max _ size, small batch size batch _ size, discount rate gamma, action value network learning rate alpha, target value network learning rate beta, strategy network learning rate eta and iteration termination number N.
In the embodiment of the present invention, the specific values of the parameters may be specifically set according to the needs of the actual application scenario and the experience of the user, and here, a set of specific parameter values is provided to help those skilled in the art understand the technical solution: empirical pool size max _ size is set to 100,000; the mini-batch size batch _ size is set to 32; discount rate γ is set to 0.75; the value network learning rate alpha is set to 0.0002; setting the target value network learning rate beta to be 0.001, setting the strategy network learning rate eta to be 0.0002 and setting the number N of termination iterations to be 450,000; specifically, the following table shows:
Figure BDA0002830591200000122
2) initializing element and action value network in experience pool E
Figure BDA0002830591200000123
Parameter omega, target value network
Figure BDA0002830591200000124
Parameter ω' of (1), policy network
Figure BDA0002830591200000125
The parameter θ of (a);
3) obtaining the observed value of each signalized intersection i at the moment t
Figure BDA0002830591200000126
Formed united traffic state s t Current _ phase of the current phase;
4) when the iteration number i is less than the termination iteration number N, the following steps are executed:
41) network according to policy
Figure BDA0002830591200000131
Calculating a probability distribution based on the probability distributionControl scheme for obtaining signal by random sampling
Figure BDA0002830591200000132
42) On-signal control scheme
Figure BDA0002830591200000133
When the current phase state is current _ phase, the current phase duration is prolonged by m seconds; on-signal control scheme
Figure BDA0002830591200000134
When the current _ phase state is not in the current phase state, releasing a section of minimum green time G of the current phase min After the time Y of the intermediate phase yellow light is finished, switching to the jth phase;
43) calculating evaluation value of evaluation network at signalized intersection i
Figure BDA0002830591200000135
And constructing a joint evaluation value calculation function
Figure BDA0002830591200000136
And calculating the observed value of each signalized intersection i at the moment t +1
Figure BDA0002830591200000137
Formed united traffic state s t+1
44) When the capacity of the experience pool E is the maximum capacity max _ size, removing the earlier experience at the moment from the experience pool E; otherwise, experience will be
Figure BDA0002830591200000138
Putting the mixture into an experience pool E;
45) when the capacity of the experience pool is larger than the small batch experience quantity batch _ size, the following steps are executed:
451) randomly sampling small batches from an experience pool E according to the experience priority value;
452) for each small batch of experience samples
Figure BDA0002830591200000139
Network for respectively calculating action value at signalized intersection i
Figure BDA00028305912000001310
Value of (D) and value of target value network
Figure BDA00028305912000001311
And obtaining the size of a baseline b value;
453) according to
Figure BDA00028305912000001312
Calculating the value of the loss function and using the Adam optimizer gradient descent method
Figure BDA00028305912000001313
Minimizing a loss function to update the parameter ω;
454) updating the target value network according to ω '═ β ω' + (1- β) ω
Figure BDA00028305912000001314
Parameter ω';
455) for each small lot of experience samples
Figure BDA00028305912000001315
Computing policy network based on Monte Carlo approximation method
Figure BDA0002830591200000141
Random strategy gradient of (2)
Figure BDA0002830591200000142
And using a gradient ascent algorithm
Figure BDA0002830591200000143
Updating the parameter theta;
46) the traffic state s at the moment t +1 is calculated t+1 Is assigned to s t And repeats steps 451) to 455).
Example 2
Fig. 4 is a schematic block diagram of a traffic signal control system based on a random policy gradient according to an embodiment of the present invention, which may be used to implement the traffic signal control method based on a random policy gradient according to embodiment 1 or any optional implementation thereof. As shown in fig. 4, the system includes: the system comprises a first data acquisition module 10, a simulation drawing module 20, a second data acquisition module 30, a simulation checking module 40, an action evaluation module 50, an action sampling module 60 and a signal control module 70. Wherein the content of the first and second substances,
the first data acquisition module 10 is configured to acquire static road network data of at least one control signalized intersection.
The simulation drawing module 20 is used for visually drawing the traffic simulation road network according to the static road network data.
The second data acquiring module 30 is configured to acquire real-time traffic operation state data of at least one control signalized intersection.
The simulation checking module 40 is configured to perform parameter checking on simulation parameters in the traffic simulation road network according to the traffic operation state data, so as to obtain an optimized traffic simulation road network.
The action evaluation module 50 is configured to input the traffic state obtained by observing the optimized traffic simulation road network into the value network, obtain an evaluation value of each signal control scheme in the traffic state, and update parameters of the value network by using a time difference algorithm. In the embodiment of the invention, the value network is a pre-constructed neural network used for approximating the action value function.
The action sampling module 60 is configured to input the traffic status into the policy network, obtain a probability value of each signal control scheme, and perform random sampling according to the probability value of each signal control scheme to obtain one signal control scheme. In the embodiment of the invention, the strategy network is a pre-constructed neural network and is used for approximating the strategy function.
The signal control module 70 is configured to update parameters of the policy network through a random policy gradient based on the evaluation value of each signal control scheme in the traffic state and one signal control scheme.
An embodiment of the present invention further provides an electronic device, as shown in fig. 5, the electronic device may include a processor 51 and a memory 52, where the processor 51 and the memory 52 may be connected by a bus or in another manner, and fig. 5 takes the connection by the bus as an example.
The processor 51 may be a Central Processing Unit (CPU). The Processor 51 may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, or combinations thereof.
The memory 52, as a non-transitory computer-readable storage medium, may be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as program instructions/modules corresponding to the traffic signal control method based on the stochastic strategy gradient in the embodiment of the present invention (for example, the first data acquisition module 10, the simulation plotting module 20, the second data acquisition module 30, the simulation checking module 40, the action evaluation module 50, the action sampling module 60, and the signal control module 70 shown in fig. 4). The processor 51 executes various functional applications and data processing of the processor by running non-transitory software programs, instructions and modules stored in the memory 52, namely, implements the traffic signal control method based on the random strategy gradient in the above method embodiment.
The memory 52 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created by the processor 51, and the like. Further, the memory 52 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 52 may optionally include memory located remotely from the processor 51, and these remote memories may be connected to the processor 51 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The one or more modules are stored in the memory 52 and, when executed by the processor 51, perform a traffic signal control method based on random policy gradients as in the embodiments shown in fig. 1-3.
The specific details of the electronic device may be understood by referring to the corresponding descriptions and effects in the embodiments shown in fig. 1 to fig. 3, which are not described herein again.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD), a Solid State Drive (SSD), or the like; the storage medium may also comprise a combination of memories of the kind described above.
It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. This need not be, nor should it be exhaustive of all embodiments. And obvious variations or modifications therefrom are within the scope of the invention.

Claims (6)

1. A traffic signal control method based on a random strategy gradient is characterized by comprising the following steps:
obtaining static road network data of at least one control signalized intersection;
visually drawing a traffic simulation road network according to the static road network data;
acquiring real-time traffic running state data of the at least one control signalized intersection;
performing parameter checking on simulation parameters in the traffic simulation road network according to the traffic operation state data to obtain an optimized traffic simulation road network;
inputting the traffic state obtained by observing the optimized traffic simulation road network into a value network to obtain an evaluation value of each signal control scheme under the traffic state, and updating parameters of the value network by adopting a time difference algorithm; the value network is a pre-constructed convolutional neural network and is used for approximating an action value function;
inputting the traffic state into a policy network to obtain a probability value of each signal control scheme, and randomly sampling according to the probability value of each signal control scheme to obtain a signal control scheme; the strategy network is a pre-constructed convolutional neural network and is used for approximating a strategy function;
updating parameters of the policy network by a random policy gradient based on the evaluation value of each signal control scheme in the traffic state and the one signal control scheme;
the traffic running state data comprises a time headway and vehicle acceleration and deceleration, and the step of performing parameter checking on simulation parameters in the traffic simulation road network according to the traffic running state data to obtain an optimized traffic simulation road network comprises the following steps:
acquiring the value ranges of the actual headway time distance parameter and the vehicle acceleration and deceleration parameter, and preliminarily checking the headway time distance parameter and the vehicle acceleration and deceleration parameter in the traffic simulation road network according to the value ranges;
observing the traffic simulation road network, acquiring a simulation headway time and a vehicle acceleration and deceleration, and comparing and analyzing the simulation headway time and the vehicle acceleration and deceleration with an actual headway time and a vehicle acceleration and deceleration;
if the difference between the simulated headway and the vehicle acceleration and deceleration and the difference between the actual headway and the vehicle acceleration and deceleration are within a preset range, finishing parameter checking to obtain the optimized traffic simulation road network; otherwise, repeating the steps until the difference is within a preset range;
the traffic state is represented as the maximum number of queued vehicles of each phase j at each signalized intersection, and specifically comprises the following steps:
Figure FDA0003629056780000011
in the formula (I), the compound is shown in the specification,
Figure FDA0003629056780000012
representing an observed value of a traffic state of a signalized intersection i at a decision time t, wherein i represents the number of each signalized intersection, and belongs to {1, 2.. N }; j is represented as a phase number, j ∈ {1, 2.. n }; t is decision time; l is a lane number; l is a radical of an alcohol j A lane group that can pass at phase j; q. q.s t,l Expressed as the number of vehicles in line on lane i at decision time t;
the number of the vehicles in the line I at the decision time t is equal to the number of the vehicles in the line I at the decision time t-1 plus or minus the number of the vehicles entering the line or leaving the line at the decision time t, and specifically comprises the following steps:
Figure FDA0003629056780000021
in the formula, q t-1,l Expressed as the number of vehicles in line, V, on lane l at decision time t-1 t,l The method comprises the following steps of representing a vehicle set which enters a lane l at a decision time t, representing whether a certain vehicle is added into a queue at the decision time t, judging whether the vehicle is added into the queue or leaves the queue, and specifically comprising the following steps:
Figure FDA0003629056780000022
in the formula (I), the compound is shown in the specification,
Figure FDA0003629056780000023
the speed of the vehicle v at the decision time t-1 and the decision time t; sp Thr A speed threshold for determining whether to join the queue;
the joint state of the signalized intersections is expressed as a vector of the observation value of each signalized intersection, and specifically comprises the following steps:
Figure FDA0003629056780000024
in the formula (I), the compound is shown in the specification,
Figure FDA0003629056780000025
the observed value of the ith signalized intersection at the decision time t is expressed;
the signal control scheme is divided into a fixed phase order of action selection and a variable phase order of action selection according to whether to change the phase order;
for a fixed phase order, the policy network at signalized intersection i has two optional actions in the decision time t of each iteration when
Figure FDA0003629056780000026
Time indicates continuation of the current phase; when in use
Figure FDA0003629056780000027
The time indicates ending the current phase and switching to the next phase, specifically:
Figure FDA0003629056780000028
for a variable phase sequence, the policy network at signalized intersection i has n selectable actions in the decision time t of each iteration, specifically:
Figure FDA0003629056780000029
if at decision time t, the strategy network at signalized intersection i decides to continue the current phase, that isIs provided with
Figure FDA0003629056780000031
Namely, the action at the moment t is the same as the action at the last decision moment p (t), the duration m seconds of the current phase is prolonged, and m is 1 to 5 seconds; the decision time for judging whether to switch the phase at the next time is the time t + m, if the current phase is decided to be ended and switched to the next phase at the time t, the current phase is switched to the next phase
Figure FDA0003629056780000032
That is, the action at the time t is different from the action at the last decision time p (t), and finally a minimum green time G of the phase is set min If the time Y of the intermediate phase yellow lamp is released again, the phase is switched to the next phase, and the decision time for judging whether the phase is switched next time is t + G min Time + Y + m;
an evaluation value calculation function of the value network at the signalized intersection i is defined as a reduction value of the maximum number of queued vehicles, and specifically comprises the following steps:
Figure FDA0003629056780000033
in the formula, L p(t),l And L t,l The number of vehicles queued on lane l at decision time p (t) and decision time t, respectively;
or, defining an evaluation value calculation function of the value network at the signalized intersection i as a reduction value of the maximum total delay, specifically:
Figure FDA0003629056780000034
in the formula, Cd p(t),v And Cd t,v Respectively accumulating the total delay of the queued vehicles at decision time p (t) and decision time t;
the joint evaluation value calculation function of the value network at a plurality of signal intersections is expressed as the coupling of the evaluation value calculation function of the value network at each signal intersection, and specifically comprises the following steps:
Figure FDA0003629056780000035
wherein J (i) is a set of value networks other than the value network of the signalized intersection i; for the joint reward function, n is a non-negative constant, when n is 0, the value network of the signalized intersection i only considers the evaluation values of the other value network sets J (i), and when n is larger, the value network of the signalized intersection i only considers the local evaluation value of the signalized intersection i.
2. The random policy gradient-based traffic signal control method according to claim 1, wherein the static road network data comprises part or all of road classes, number of lanes, lane widths, lane functional divisions, road segment lengths, road markings, intersection types, adjacent intersection information, signal device numbers, phase information, and phase sequence information;
the traffic running state data further comprises part or all of the equipment ID, the detection time, the traffic flow, the vehicle type distribution, the vehicle time occupancy, the vehicle space occupancy, the vehicle speed, the vehicle length, the vehicle head distance, the queuing length and the parking times.
3. The method of claim 1, wherein the value network is a dual delay depth Q network comprising an action value network for selecting an action
Figure FDA0003629056780000041
And a target value network for calculating the Q value
Figure FDA0003629056780000042
Parameter ω ═ ω 1 ,…,ω i ,…,ω N ]Expressed as N parameters of the motion value network, parameter ω '═ ω' 1 ,…,ω′ i ,…,ω′ N ]N parameters expressed as a target value network;
training the value network and the strategy network, comprising the steps of:
1) inputting reinforcement learning related parameters: experience pool capacity max _ size, small batch size batch _ size, discount rate gamma, action value network learning rate alpha, target value network learning rate beta, strategy network learning rate eta and termination iteration number N;
2) initializing element, action value network in experience pool E
Figure FDA0003629056780000043
Parameter omega, target value network of
Figure FDA0003629056780000044
Parameter ω' of (C), policy network
Figure FDA0003629056780000045
The parameter θ of (a);
3) obtaining the observed value of each signalized intersection i at the moment t
Figure FDA0003629056780000046
Formed united traffic state s t Current _ phase of the current phase;
4) when the iteration number i is less than the termination iteration number N, executing the following steps:
41) network according to policy
Figure FDA0003629056780000047
Calculating probability distribution, and randomly sampling to obtain signal control scheme based on the probability distribution
Figure FDA0003629056780000048
42) When the signal controls the scheme
Figure FDA0003629056780000049
For the current phaseWhen the current _ phase is in the state, the current phase duration is prolonged by m seconds; when the signal controls the scheme
Figure FDA00036290567800000410
When current _ phase is not in the current phase state, the minimum green time G of the current phase is released min After the intermediate phase yellow light time Y is finished, switching to the jth phase;
43) calculating evaluation value of evaluation network at signalized intersection i
Figure FDA00036290567800000411
And constructing a joint evaluation value calculation function
Figure FDA0003629056780000051
And calculating the observed value of each signalized intersection i at the moment t +1
Figure FDA0003629056780000052
Formed united traffic state s t+1
44) When the capacity of the experience pool E is the maximum capacity max _ size, removing the earlier experience at the moment from the experience pool E; otherwise, experience will be
Figure FDA0003629056780000053
Putting the mixture into an experience pool E;
45) when the capacity of the experience pool is larger than the small batch experience quantity batch _ size, the following steps are executed:
451) randomly sampling small batches from an experience pool E according to the experience priority value;
452) for each small lot of experience samples
Figure FDA0003629056780000054
Network for respectively calculating action value at signalized intersection i
Figure FDA0003629056780000055
Value of (D) and target valueValue of the network
Figure FDA0003629056780000056
And obtaining the value of the baseline b;
453) according to
Figure FDA0003629056780000057
Calculating the value of the loss function and using the Adam optimizer gradient descent method
Figure FDA0003629056780000058
Minimizing a loss function to update the parameter ω;
454) updating the target value network according to ω '═ β ω' + (1- β) ω
Figure FDA0003629056780000059
Parameter ω';
455) for each small batch of experience samples
Figure FDA00036290567800000510
Computing policy network based on Monte Carlo approximation method
Figure FDA00036290567800000511
Random strategy gradient of
Figure FDA00036290567800000512
And using a gradient ascent algorithm
Figure FDA00036290567800000513
Updating the parameter theta;
46) the traffic state s at the moment t +1 is calculated t+1 Is assigned to s t And repeats steps 451) to 455).
4. A traffic signal control system based on a stochastic strategy gradient, comprising:
the first data acquisition module is used for acquiring static road network data of at least one control signalized intersection;
the simulation drawing module is used for drawing a traffic simulation road network according to the static road network data in a visualized manner;
the second data acquisition module is used for acquiring real-time traffic running state data of the at least one control signalized intersection;
the simulation checking module is used for performing parameter checking on the simulation parameters in the traffic simulation road network according to the traffic running state data to obtain an optimized traffic simulation road network;
the action evaluation module is used for inputting the traffic state obtained by observing the optimized traffic simulation road network into a value network to obtain the evaluation value of each signal control scheme under the traffic state, and updating the parameters of the value network by adopting a time difference algorithm; the value network is a pre-constructed neural network and is used for approximating an action value function;
the action sampling module is used for inputting the traffic state into a policy network to obtain the probability value of each signal control scheme and randomly sampling according to the probability value of each signal control scheme to obtain a signal control scheme; the strategy network is a pre-constructed neural network and is used for approximating a strategy function;
the signal control module is used for updating parameters of the strategy network through random strategy gradients based on the evaluation value of each signal control scheme in the traffic state and the signal control scheme;
the traffic running state data comprises a vehicle head time interval and vehicle acceleration and deceleration, and the simulation checking module comprises:
acquiring the value ranges of the actual headway time distance parameter and the vehicle acceleration and deceleration parameter, and preliminarily checking the headway time distance parameter and the vehicle acceleration and deceleration parameter in the traffic simulation road network according to the value ranges;
observing the traffic simulation road network, acquiring a simulation headway and a vehicle acceleration and deceleration, and comparing and analyzing the simulation headway and the vehicle acceleration and deceleration with an actual headway and an actual vehicle acceleration and deceleration;
if the difference between the simulated headway and the vehicle acceleration and deceleration and the difference between the actual headway and the vehicle acceleration and deceleration are within a preset range, ending parameter checking to obtain the optimized traffic simulation road network; otherwise, repeating the steps until the difference is within a preset range;
the traffic state is expressed as the maximum number of queued vehicles of each phase j at each signalized intersection, and specifically comprises the following steps:
Figure FDA0003629056780000061
in the formula (I), the compound is shown in the specification,
Figure FDA0003629056780000062
representing an observed value of a traffic state of a signalized intersection i at a decision time t, wherein i represents the number of each signalized intersection, and belongs to {1, 2.. N }; j is represented as a phase number, j ∈ {1, 2.. n }; t is decision time; l is a lane number; l is j A lane group that is passable at phase j; q. q.s t,l Expressed as the number of vehicles in line on lane i at decision time t;
the number of the vehicles in the line I at the decision time t is equal to the number of the vehicles in the line I at the decision time t-1 plus or minus the number of the vehicles entering the line or leaving the line at the decision time t, and specifically comprises the following steps:
Figure FDA0003629056780000071
in the formula, q t-1,l Expressed as the number of vehicles in line, V, on lane l at decision time t-1 t,l The method comprises the following steps of representing a vehicle set which enters a lane l at a decision time t, representing whether a certain vehicle is added into a queue at the decision time t, judging whether the vehicle is added into the queue or leaves the queue, and specifically comprising the following steps:
Figure FDA0003629056780000072
in the formula (I), the compound is shown in the specification,
Figure FDA0003629056780000073
the speed of the vehicle v at the decision time t-1 and the decision time t; sp Thr A speed threshold for determining whether to join the queue;
the joint state of a plurality of signalized intersections is expressed as a vector of the observed value of each signalized intersection, and specifically comprises the following steps:
Figure FDA0003629056780000074
in the formula (I), the compound is shown in the specification,
Figure FDA0003629056780000075
the observed value of the ith signalized intersection at the decision time t is expressed;
the signal control scheme is divided into a fixed phase order of action selection and a variable phase order of action selection according to whether to change the phase order;
for a fixed phase order, the policy network at signalized intersection i has two optional actions in the decision time t of each iteration when
Figure FDA0003629056780000076
Time indicates continuation of the current phase; when in use
Figure FDA0003629056780000077
The time indicates ending the current phase and switching to the next phase, specifically:
Figure FDA0003629056780000078
for a variable phase sequence, the policy network at signalized intersection i has n selectable actions at decision time t of each iteration, specifically:
Figure FDA0003629056780000079
if at decision time t, the strategy network at the signalized intersection i decides to continue the current phase, namely
Figure FDA00036290567800000710
Namely, the action at the moment t is the same as the action at the last decision moment p (t), the duration m seconds of the current phase is prolonged, and m is 1 to 5 seconds; the decision time for judging whether to switch the phase at the next time is the time t + m, if the current phase is decided to be ended and switched to the next phase at the time t, the current phase is switched to the next phase
Figure FDA0003629056780000081
That is, the action at the time t is different from the action at the last decision time p (t), and finally a minimum green time G of the phase is set min If the time Y of the intermediate phase yellow lamp is released again, the phase is switched to the next phase, and the decision time for judging whether the phase is switched next time is t + G min + Y + m time;
an evaluation value calculation function of the value network at the signalized intersection i is defined as a reduction value of the maximum number of queued vehicles, and specifically comprises the following steps:
Figure FDA0003629056780000082
in the formula, L p(t),l And L t,l Respectively the number of the vehicles which are lined up on the lane l at the decision time p (t) and the decision time t;
or, defining an evaluation value calculation function of the value network at the signalized intersection i as a reduction value of the maximum total delay, specifically:
Figure FDA0003629056780000083
in the formula, Cd p(t),v And Cd t,v Are respectively in decisionAccumulating the total delay of the queued vehicles at the time p (t) and the decision time t;
the joint evaluation value calculation function of the value network at the plurality of signal intersections is expressed as the coupling of the evaluation value calculation functions of the value network at each signal intersection, and specifically comprises the following steps:
Figure FDA0003629056780000084
wherein J (i) is a set of value networks other than the value network of the signalized intersection i; for the joint reward function, n is a non-negative constant, when n is 0, the value network of the signalized intersection i only considers the evaluation values of the other value network sets J (i), and when n is larger, the value network of the signalized intersection i only considers the local evaluation value of the signalized intersection i.
5. An electronic device, comprising: a memory and a processor, the memory and the processor being communicatively connected to each other, the memory storing therein computer instructions, and the processor executing the computer instructions to perform the traffic signal control method based on random strategy gradients of any one of claims 1-3.
6. A computer-readable storage medium, characterized in that the computer-readable storage medium stores computer instructions for causing the computer to execute the method for stochastic policy gradient based traffic signal control according to any of claims 1 to 3.
CN202011459044.2A 2020-12-11 2020-12-11 Traffic signal control method and system based on random strategy gradient and electronic equipment Active CN112614343B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202011459044.2A CN112614343B (en) 2020-12-11 2020-12-11 Traffic signal control method and system based on random strategy gradient and electronic equipment
PCT/CN2021/124593 WO2022121510A1 (en) 2020-12-11 2021-10-19 Stochastic policy gradient-based traffic signal control method and system, and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011459044.2A CN112614343B (en) 2020-12-11 2020-12-11 Traffic signal control method and system based on random strategy gradient and electronic equipment

Publications (2)

Publication Number Publication Date
CN112614343A CN112614343A (en) 2021-04-06
CN112614343B true CN112614343B (en) 2022-08-19

Family

ID=75234428

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011459044.2A Active CN112614343B (en) 2020-12-11 2020-12-11 Traffic signal control method and system based on random strategy gradient and electronic equipment

Country Status (2)

Country Link
CN (1) CN112614343B (en)
WO (1) WO2022121510A1 (en)

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112614343B (en) * 2020-12-11 2022-08-19 多伦科技股份有限公司 Traffic signal control method and system based on random strategy gradient and electronic equipment
CN113362618B (en) * 2021-06-03 2022-08-09 东南大学 Multi-mode traffic adaptive signal control method and device based on strategy gradient
CN114038217B (en) * 2021-10-28 2023-11-17 李迎 Traffic signal configuration and control method
CN114446066B (en) * 2021-12-30 2023-05-16 银江技术股份有限公司 Road signal control method and device
CN114613159B (en) * 2022-02-10 2023-07-28 北京箩筐时空数据技术有限公司 Traffic signal lamp control method, device and equipment based on deep reinforcement learning
CN114743388B (en) * 2022-03-22 2023-06-20 中山大学·深圳 Multi-intersection signal self-adaptive control method based on reinforcement learning
CN115100850A (en) * 2022-04-21 2022-09-23 浙江省交通投资集团有限公司智慧交通研究分公司 Hybrid traffic flow control method, medium, and apparatus based on deep reinforcement learning
CN114898576B (en) * 2022-05-10 2023-12-19 阿波罗智联(北京)科技有限公司 Traffic control signal generation method and target network model training method
CN115331428B (en) * 2022-07-05 2023-10-17 成利吉(厦门)智能股份有限公司 Traffic signal optimization method based on rule base
CN115171408B (en) * 2022-07-08 2023-05-30 华侨大学 Traffic signal optimization control method
CN115310278A (en) * 2022-07-28 2022-11-08 东南大学 Simulation method and verification method for large-scale road network online micro traffic
CN115440042B (en) * 2022-09-02 2024-02-02 吉林大学 Multi-agent constraint strategy optimization-based signalless intersection cooperative control method
CN115762128B (en) * 2022-09-28 2024-03-29 南京航空航天大学 Deep reinforcement learning traffic signal control method based on self-attention mechanism
CN116153065A (en) * 2022-12-29 2023-05-23 山东大学 Intersection traffic signal refined optimization method and device under vehicle-road cooperative environment
CN116597672B (en) * 2023-06-14 2024-02-13 南京云创大数据科技股份有限公司 Regional signal lamp control method based on multi-agent near-end strategy optimization algorithm
CN117151441B (en) * 2023-10-31 2024-01-30 长春工业大学 Replacement flow workshop scheduling method based on actor-critique algorithm
CN117173914B (en) * 2023-11-03 2024-01-26 中泰信合智能科技有限公司 Road network signal control unit decoupling method, device and medium for simplifying complex model
CN117275259B (en) * 2023-11-20 2024-02-06 北京航空航天大学 Multi-intersection cooperative signal control method based on field information backtracking
CN117671977A (en) * 2024-02-01 2024-03-08 银江技术股份有限公司 Signal lamp control method, system, device and medium for traffic trunk line

Family Cites Families (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE102008049568A1 (en) * 2008-09-30 2010-04-08 Siemens Aktiengesellschaft A method of optimizing traffic control at a traffic signal controlled node in a road traffic network
US8571743B1 (en) * 2012-04-09 2013-10-29 Google Inc. Control of vehicles based on auditory signals
CN105955930A (en) * 2016-05-06 2016-09-21 天津科技大学 Guidance-type policy search reinforcement learning algorithm
CN109215355A (en) * 2018-08-09 2019-01-15 北京航空航天大学 A kind of single-point intersection signal timing optimization method based on deeply study
CN109472984A (en) * 2018-12-27 2019-03-15 苏州科技大学 Signalized control method, system and storage medium based on deeply study
CN109559530B (en) * 2019-01-07 2020-07-14 大连理工大学 Multi-intersection signal lamp cooperative control method based on Q value migration depth reinforcement learning
CN110047278B (en) * 2019-03-30 2021-06-08 北京交通大学 Adaptive traffic signal control system and method based on deep reinforcement learning
CN111833590B (en) * 2019-04-15 2021-12-07 北京京东尚科信息技术有限公司 Traffic signal lamp control method and device and computer readable storage medium
CN110428615B (en) * 2019-07-12 2021-06-22 中国科学院自动化研究所 Single intersection traffic signal control method, system and device based on deep reinforcement learning
CN110673602B (en) * 2019-10-24 2022-11-25 驭势科技(北京)有限公司 Reinforced learning model, vehicle automatic driving decision method and vehicle-mounted equipment
CN110930734A (en) * 2019-11-30 2020-03-27 天津大学 Intelligent idle traffic indicator lamp control method based on reinforcement learning
CN111311945B (en) * 2020-02-20 2021-07-09 南京航空航天大学 Driving decision system and method fusing vision and sensor information
CN111696370B (en) * 2020-06-16 2021-09-03 西安电子科技大学 Traffic light control method based on heuristic deep Q network
CN111737826B (en) * 2020-07-17 2020-11-24 北京全路通信信号研究设计院集团有限公司 Rail transit automatic simulation modeling method and device based on reinforcement learning
CN112052456A (en) * 2020-08-31 2020-12-08 浙江工业大学 Deep reinforcement learning strategy optimization defense method based on multiple intelligent agents
CN112614343B (en) * 2020-12-11 2022-08-19 多伦科技股份有限公司 Traffic signal control method and system based on random strategy gradient and electronic equipment

Also Published As

Publication number Publication date
WO2022121510A1 (en) 2022-06-16
CN112614343A (en) 2021-04-06

Similar Documents

Publication Publication Date Title
CN112614343B (en) Traffic signal control method and system based on random strategy gradient and electronic equipment
CN111696370B (en) Traffic light control method based on heuristic deep Q network
CN112669629B (en) Real-time traffic signal control method and device based on deep reinforcement learning
CN114758497B (en) Adaptive parking lot variable entrance and exit control method, device and storage medium
WO2021051930A1 (en) Signal adjustment method and apparatus based on action prediction model, and computer device
CN114463997A (en) Lantern-free intersection vehicle cooperative control method and system
CN110942627A (en) Road network coordination signal control method and device for dynamic traffic
CN115019523B (en) Deep reinforcement learning traffic signal coordination optimization control method based on minimized pressure difference
CN114419884B (en) Self-adaptive signal control method and system based on reinforcement learning and phase competition
CN113515892B (en) Multi-agent traffic simulation parallel computing method and device
CN113724507A (en) Traffic control and vehicle induction cooperation method and system based on deep reinforcement learning
GB2607880A (en) Traffic control system
Chentoufi et al. A hybrid particle swarm optimization and tabu search algorithm for adaptive traffic signal timing optimization
US20230162539A1 (en) Driving decision-making method and apparatus and chip
CN115472023B (en) Intelligent traffic light control method and device based on deep reinforcement learning
CN115083149B (en) Reinforced learning variable duration signal lamp control method for real-time monitoring
CN116189454A (en) Traffic signal control method, device, electronic equipment and storage medium
Huang et al. A modified cell transmission model considering queuing characteristics for channelized zone at signalized intersections
CN115981302A (en) Vehicle following lane change behavior decision-making method and device and electronic equipment
Chen et al. Traffic signal optimization control method based on adaptive weighted averaged double deep Q network
CN113268857A (en) Urban expressway intersection area micro traffic simulation method and device based on multiple intelligent agents
CN114299714B (en) Multi-turn-channel coordination control method based on different strategy reinforcement learning
CN114639255B (en) Traffic signal control method, device, equipment and medium
CN113753049B (en) Social preference-based automatic driving overtaking decision determination method and system
CN116895158A (en) Urban road network traffic signal control method based on multi-agent Actor-Critic and GRU

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Zheng Peiyu

Inventor after: Zhang Daoyang

Inventor after: Tao Gang

Inventor after: Chen Bo

Inventor after: Li Zhibin

Inventor after: Chen Bing

Inventor after: Yang Guang

Inventor before: Zheng Peiyu

Inventor before: Tao Gang

Inventor before: Chen Bo

Inventor before: Li Zhibin

Inventor before: Chen Bing

Inventor before: Yang Guang