CN112614343A - Traffic signal control method and system based on random strategy gradient and electronic equipment - Google Patents

Traffic signal control method and system based on random strategy gradient and electronic equipment Download PDF

Info

Publication number
CN112614343A
CN112614343A CN202011459044.2A CN202011459044A CN112614343A CN 112614343 A CN112614343 A CN 112614343A CN 202011459044 A CN202011459044 A CN 202011459044A CN 112614343 A CN112614343 A CN 112614343A
Authority
CN
China
Prior art keywords
network
value
traffic
signal control
phase
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011459044.2A
Other languages
Chinese (zh)
Other versions
CN112614343B (en
Inventor
郑培余
陶刚
陈波
李志斌
陈冰
杨光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Duolun Technology Co Ltd
Original Assignee
Duolun Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Duolun Technology Co Ltd filed Critical Duolun Technology Co Ltd
Priority to CN202011459044.2A priority Critical patent/CN112614343B/en
Publication of CN112614343A publication Critical patent/CN112614343A/en
Priority to PCT/CN2021/124593 priority patent/WO2022121510A1/en
Application granted granted Critical
Publication of CN112614343B publication Critical patent/CN112614343B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/01Detecting movement of traffic to be counted or controlled
    • G08G1/0104Measuring and analyzing of parameters relative to traffic conditions
    • G08G1/0125Traffic data processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/10Geometric CAD
    • G06F30/18Network design, e.g. design based on topological or interconnect aspects of utility systems, piping, heating ventilation air conditioning [HVAC] or cabling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/01Detecting movement of traffic to be counted or controlled
    • G08G1/0104Measuring and analyzing of parameters relative to traffic conditions
    • G08G1/0137Measuring and analyzing of parameters relative to traffic conditions for specific applications
    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/07Controlling traffic signals
    • G08G1/08Controlling traffic signals according to detected number or speed of vehicles
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The invention discloses a traffic signal control method, a system and electronic equipment based on a random strategy gradient, wherein the method comprises the following steps: obtaining static road network data of at least one control signalized intersection; visually drawing a traffic simulation road network according to the static road network data; acquiring real-time traffic running state data of at least one control signalized intersection; performing parameter checking on simulation parameters in the traffic simulation road network according to the traffic running state data to obtain an optimized traffic simulation road network; inputting the traffic state obtained by observing the optimized traffic simulation road network into a value network to obtain an evaluation value of each signal control scheme under the traffic state; inputting the traffic state into a policy network to obtain a probability value of each signal control scheme; and updating parameters of the strategy network through random strategy gradients based on the evaluation value of each signal control scheme in the traffic state and one signal control scheme. The method provided by the invention can solve the problem of dimension explosion of signal control.

Description

Traffic signal control method and system based on random strategy gradient and electronic equipment
Technical Field
The invention relates to the technical field of intelligent traffic, in particular to a traffic signal control method, a system and electronic equipment based on a random strategy gradient.
Background
In order to meet the rapid increase of urban traffic demand, not only road infrastructure needs to be newly built to improve the overall traffic capacity of a city, but also the existing traffic infrastructure of the city needs to be improved, and the traffic efficiency of the existing road is improved through an intelligent traffic management and control technology. The intersection is used as a key node of an urban road network and is also one of research hotspots for urban intelligent traffic control. The urban intelligent traffic control system takes the intersection as a real-time controllable system and carries out continuous real-time monitoring, diagnosis, modeling and control on the intersection. However, the traditional intersection signal control system based on the fixed timing scheme cannot adapt to the nonlinearity, randomness, ambiguity and uncertainty of a traffic system.
The self-adaptive traffic signal control system can respond to the dynamic change of traffic flow in time and optimize a signal timing scheme for real-time control. However, the existing adaptive traffic signal control system has the following limitations: (1) the problem of dimension explosion of a plurality of intersections can be simultaneously solved; (2) lack of an accurate traffic model framework to represent the dynamics and randomness of traffic flow in response to optimal changes in signal control; (3) the failure of the detector and the communication failure greatly affect the stability of the system.
Reinforcement learning is an unsupervised machine learning method, and can directly interact with a traffic simulation road network to learn a control strategy. The intelligent agent obtains the state by observing the simulation road network, selects an action from the action set based on the strategy function, and after the action is executed, the simulation road network feeds back a reward supervision signal for evaluating the quality of the selected action, and meanwhile, the simulation road network is updated to the next state, and the intelligent agent can repeat the process until an epicode is finished to obtain the maximum accumulated reward. Therefore, the adaptive traffic signal control based on reinforcement learning can adapt to the dynamic property and the randomness of a traffic system, and has obvious advantages compared with the traditional signal control and induction signal control based on a fixed timing scheme. However, the actions of the traditional reinforcement learning such as Q learning are selected based on the Q value in the Q table, and the method has the disadvantages that only limited state-action pairs can be processed, a huge state space cannot be processed, and the problem of dimension explosion occurs due to the overlarge state space, so that the efficiency of strategy learning is low, the accuracy is low, and the like.
Disclosure of Invention
The invention provides a traffic signal control method, system and electronic equipment based on random strategy gradients, and aims to solve the problems that in the prior art, limited state-action pairs can be processed by using traditional reinforcement learning to control traffic signals, huge state space cannot be processed, and dimension explosion can occur due to overlarge state space, so that the efficiency of strategy learning is low and the accuracy is low.
According to a first aspect, the invention provides a traffic signal control method based on a random strategy gradient, comprising the following steps:
obtaining static road network data of at least one control signalized intersection;
visually drawing a traffic simulation road network according to the static road network data;
acquiring real-time traffic running state data of at least one control signalized intersection;
performing parameter checking on simulation parameters in the traffic simulation road network according to the traffic running state data to obtain an optimized traffic simulation road network;
inputting the traffic state obtained by observing the optimized traffic simulation road network into a value network to obtain an evaluation value of each signal control scheme under the traffic state, and updating parameters of the value network by adopting a time difference algorithm; the value network is a pre-constructed convolutional neural network and is used for approximating an action value function;
inputting the traffic state into a policy network to obtain a probability value of each signal control scheme, and randomly sampling according to the probability value of each signal control scheme to obtain a signal control scheme; the strategy network is a pre-constructed convolutional neural network and is used for approximating a strategy function;
and updating parameters of the strategy network through random strategy gradients based on the evaluation value of each signal control scheme in the traffic state and one signal control scheme.
Optionally, the step of obtaining the optimized traffic simulation road network by checking the simulation parameters in the traffic simulation road network according to the traffic operation state data, where the traffic operation state data includes a headway and an acceleration/deceleration of a vehicle, includes:
acquiring the value ranges of the actual headway time distance parameter and the vehicle acceleration and deceleration parameter, and primarily checking the headway time distance parameter and the vehicle acceleration and deceleration parameter in the traffic simulation road network according to the value ranges;
observing a traffic simulation road network, acquiring a simulation headway and a vehicle acceleration and deceleration, and comparing and analyzing the simulation headway and the vehicle acceleration and deceleration with an actual headway and the vehicle acceleration and deceleration;
if the difference between the simulated headway and the vehicle acceleration and deceleration and the difference between the actual headway and the vehicle acceleration and deceleration are within the preset range, finishing parameter checking to obtain an optimized traffic simulation road network; otherwise, repeating the steps until the difference is within the preset range.
Optionally, the static road network data includes part or all of road grade, number of lanes, lane width, lane function division, road length, road marking, intersection type, adjacent intersection information, signal equipment number, phase information, and phase sequence information;
the traffic operation state data may further include a device ID, a detection time, a traffic volume, a vehicle type distribution, a vehicle time occupancy, a vehicle space occupancy, a vehicle speed, a vehicle length, a vehicle head interval, a queuing length, and a number of times of parking, in part or in whole.
Optionally, the traffic state is expressed as a maximum number of queued vehicles per phase j at each signalized intersection, specifically:
Figure BDA0002830591200000031
in the formula (I), the compound is shown in the specification,
Figure BDA0002830591200000032
representing an observed value of a traffic state of a signalized intersection i at a decision time t, wherein i represents the number of each signalized intersection, and belongs to {1, 2.. N }; j is represented as a phase number, j ∈ {1, 2.. n }; t is decision time; l is a lane number; l isjIndicating a set of lanes that can be traversed at phase j; q. q.st,lExpressed as the number of vehicles in line on lane i at decision time t;
the number of the vehicles in the line I at the decision time t is equal to the number of the vehicles in the line I at the decision time t-1 plus or minus the number of the vehicles entering the line or leaving the line at the decision time t, and specifically comprises the following steps:
Figure BDA0002830591200000033
in the formula, qt-1,lExpressed as the number of vehicles in line, V, on lane l at decision time t-1t,lThe method comprises the following steps of representing a vehicle set which enters a lane l at a decision time t, representing whether a certain vehicle is added into a queue at the decision time t, judging whether the vehicle is added into the queue or leaves the queue, and specifically comprising the following steps:
Figure BDA0002830591200000034
in the formula (I), the compound is shown in the specification,
Figure BDA0002830591200000035
the speed of the vehicle v at the decision time t-1 and the decision time t; spThrA speed threshold for determining whether to join the queue;
the joint state of the signalized intersections is expressed as a vector of the observation value of each signalized intersection, and specifically comprises the following steps:
Figure BDA0002830591200000036
in the formula (I), the compound is shown in the specification,
Figure BDA0002830591200000037
expressed as the observed value at decision time t for the ith signalized intersection.
Alternatively, the signal control scheme is divided into a fixed phase order of action selection and a variable phase order of action selection according to whether the order of the phases is changed;
for a fixed phase order, the policy network at signalized intersection i has two optional actions in the decision time t of each iteration, when
Figure BDA0002830591200000041
Time indicates continuation of the current phase; when in use
Figure BDA0002830591200000042
The time indicates ending the current phase and switching to the next phase, specifically:
Figure BDA0002830591200000043
for a variable phase sequence, the policy network at signalized intersection i has n selectable actions at decision time t of each iteration, specifically:
Figure BDA0002830591200000044
if at decision time t, the strategy network at the signalized intersection i decides to continue the current phase, namely
Figure BDA0002830591200000045
Namely, the action at the moment t is the same as the action at the last decision moment p (t), the duration m seconds of the current phase is prolonged, and m is 1 to 5 seconds; the decision time for judging whether to switch the phase at the next time is the time t + m, if the current phase is decided to be ended and switched to the next phase at the time t, the current phase is switched to the next phase
Figure BDA0002830591200000046
That is, the action at the time t is different from the action at the last decision time p (t), and finally a minimum green time G of the phase is setminIf the time Y of the intermediate phase yellow lamp is released again, the phase is switched to the next phase, and the decision time for judging whether the phase is switched next time is t + GminTime + Y + m.
Optionally, the evaluation value calculation function of the value network at the signalized intersection i is defined as a reduction value of the maximum number of queued vehicles, and specifically includes:
Figure BDA0002830591200000047
in the formula, Lp(t),lAnd Lt,lThe number of vehicles queued on lane l at decision time p (t) and decision time t, respectively;
or, defining an evaluation value calculation function of the value network at the signalized intersection i as a reduction value of the maximum total delay, specifically:
Figure BDA0002830591200000048
in the formula, Cdp(t),vAnd Cdt,vRespectively accumulating the total delay of the queued vehicles at decision time p (t) and decision time t;
the joint evaluation value calculation function of the value network at a plurality of signal intersections is expressed as the coupling of the evaluation value calculation function of the value network at each signal intersection, and specifically comprises the following steps:
Figure BDA0002830591200000049
wherein J (i) is a set of value networks other than the value network of the signalized intersection i; for the joint reward function, n is a non-negative constant, when n is 0, the value network of the signalized intersection i only considers the evaluation values of other value network sets J (i), and when n is larger, the value network of the signalized intersection i only considers the local evaluation value of the signalized intersection i.
Optionally, the value network is a dual delay deep Q network, comprising an action value network for selecting an action
Figure BDA0002830591200000051
And a target value network for calculating the Q value
Figure BDA0002830591200000052
Parameter ω ═ ω1,…,ωi,…,ωN]Expressed as N parameters of the motion value network, parameter ω '═ ω'1,…,ω′i,…,ω′N]N parameters expressed as a target value network;
training a value network and a strategy network, comprising the steps of:
1) inputting reinforcement learning related parameters: the method comprises the following steps of (1) obtaining an experience pool capacity max _ size, a small batch size batch _ size, a discount rate gamma, an action value network learning rate alpha, a target value network learning rate beta, a strategy network learning rate eta and a termination iteration number N;
2) initializing element and action value network in experience pool E
Figure BDA0002830591200000053
Parameter omega, target value network of
Figure BDA0002830591200000054
Parameter ω' of (1), policy network
Figure BDA0002830591200000055
The parameter θ of (a);
3) obtaining the observed value of each signalized intersection i at the moment t
Figure BDA0002830591200000056
Formed united traffic state stCurrent _ phase of the current phase;
4) when the iteration number i is less than the termination iteration number N, executing the following steps:
41) network according to policy
Figure BDA0002830591200000057
Calculating probability distribution, and randomly sampling according to the probability distribution to obtain signal control scheme
Figure BDA0002830591200000058
42) On-signal control scheme
Figure BDA0002830591200000059
When the current _ phase state is the current _ phase state, the current phase duration is prolonged by m seconds; on-signal control scheme
Figure BDA00028305912000000510
When the current _ phase state is not in the current phase state, releasing a section of minimum green time G of the current phaseminAfter the time Y of the intermediate phase yellow light is finished, switching to the jth phase;
43) calculating evaluation value of evaluation network at signalized intersection i
Figure BDA00028305912000000511
And constructing a joint evaluation value calculation function
Figure BDA0002830591200000061
And calculating the observed value of each signalized intersection i at the moment t +1
Figure BDA0002830591200000062
Formed united traffic state st+1
44) When the capacity of the experience pool E is the maximum capacity max _ size, removing the earlier experience at the moment from the experience pool E; otherwise, experience will be
Figure BDA0002830591200000063
Putting the obtained product into an experience pool E;
45) when the capacity of the experience pool is larger than the small batch experience quantity batch _ size, the following steps are executed:
451) randomly sampling small batches from an experience pool E according to the experience priority value;
452) for each small batch of experience samples
Figure BDA0002830591200000064
Network for respectively calculating action value at signalized intersection i
Figure BDA0002830591200000065
Value of (D) and value of target value network
Figure BDA0002830591200000066
And obtaining the size of a baseline b value;
453) according to
Figure BDA0002830591200000067
Calculating the value of the loss function and using the Adam optimizer gradient descent method
Figure BDA0002830591200000068
Minimizing a loss function to update the parameter ω;
454) updating the target value network according to ω '═ β ω' + (1- β) ω
Figure BDA0002830591200000069
Parameter ω';
455) for theEach small batch of experience samples
Figure BDA00028305912000000610
Computing policy network based on Monte Carlo approximation method
Figure BDA00028305912000000611
Random strategy gradient of
Figure BDA00028305912000000612
And using a gradient ascent algorithm
Figure BDA00028305912000000613
Updating the parameter theta;
46) the traffic state s at the moment t +1 is calculatedt+1Is assigned to stAnd repeats step 451) to step 455).
According to a second aspect, the present invention provides a traffic signal control system based on a stochastic strategy gradient, comprising:
the first data acquisition module is used for acquiring static road network data of at least one control signalized intersection;
the simulation drawing module is used for drawing the traffic simulation road network according to the static road network data in a visualized manner;
the second data acquisition module is used for acquiring real-time traffic running state data of at least one control signalized intersection;
the simulation checking module is used for performing parameter checking on simulation parameters in the traffic simulation road network according to the traffic running state data to obtain an optimized traffic simulation road network;
the action evaluation module is used for inputting the traffic state obtained by observing the optimized traffic simulation road network into the value network to obtain the evaluation value of each signal control scheme under the traffic state, and updating the parameters of the value network by adopting a time difference algorithm; the value network is a pre-constructed neural network and is used for approximating an action value function;
the action sampling module is used for inputting the traffic state into the policy network to obtain the probability value of each signal control scheme and randomly sampling according to the probability value of each signal control scheme to obtain a signal control scheme; the strategy network is a pre-constructed neural network and is used for approximating a strategy function;
and the signal control module is used for updating parameters of the strategy network through random strategy gradients based on the evaluation value of each signal control scheme and one signal control scheme under the traffic state.
According to a third aspect, the invention provides an electronic device comprising: the traffic signal control method based on the random strategy gradient comprises a memory and a processor, wherein the memory and the processor are mutually connected in a communication mode, the memory stores computer instructions, and the processor executes the computer instructions so as to execute the traffic signal control method based on the random strategy gradient.
According to a fourth aspect, the present invention provides a computer-readable storage medium having stored thereon computer instructions for causing a computer to execute the random policy gradient-based traffic signal control method of the first aspect.
The invention has the beneficial effects that:
1. compared with the parameterized representation of a Q value function in the prior art, the traffic signal control method based on the random strategy gradient provided by the invention has the advantages that the parameterization of the strategy function is simpler, the convergence is better, the learning efficiency and the accuracy are higher, and the problem of dimension explosion generally can not occur.
2. The traffic signal control method based on the random strategy gradient provided by the invention can adapt to the nonlinearity, randomness, fuzziness and uncertainty of a traffic system by continuously monitoring, diagnosing, modeling and controlling the intersection in real time.
3. According to the traffic signal control method based on the random strategy gradient, the problem that original traffic data and traffic states are too large can be solved by adopting the convolutional neural network in deep learning, the convolutional neural network inputs original high-dimensional data, the bottom-layer features are combined to form more abstract high-layer features, hidden features in the high-dimensional traffic states are captured, control can be directly carried out according to the input high-dimensional data, the feature representation capability of a state input matrix is improved, and the generalization capability of the method on representation of different traffic states is enhanced.
4. Compared with the traditional timing control and induction control, the traffic signal control method based on the random strategy gradient can respond to the dynamic change of the traffic flow in time, optimize the signal timing scheme for real-time control, finally reduce the driving delay of the road network and improve the traffic efficiency of the road network.
Drawings
FIG. 1 is a flow chart of a traffic signal control method based on a stochastic strategy gradient provided by the present invention;
FIG. 2 is a schematic illustration of an exemplary traffic network in an embodiment of the present invention;
FIG. 3 is a flowchart illustrating the detailed steps of step S400 in FIG. 1;
FIG. 4 is a functional block diagram of a traffic signal control system based on a stochastic strategy gradient provided by the present invention;
fig. 5 is a schematic diagram of a hardware structure of an electronic device according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the description of the present invention, it should be noted that the terms "first", "second", and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
Example 1
Fig. 3 shows a flow chart of a traffic signal control method based on a random strategy gradient according to an embodiment of the present invention, and as shown in fig. 3, the method may include the following steps:
step S100: and acquiring static road network data of at least one control signalized intersection.
In an embodiment of the present invention, the static road network data includes part or all of road grade, number of lanes, lane width, lane function division, road length, road marking, intersection type, adjacent intersection information, signal device number, phase information, and phase sequence information.
Step S200: and visually drawing a traffic simulation road network according to the static road network data.
In the embodiment of the invention, microscopic traffic simulation software, such as SUMO, can be used for drawing a traffic simulation road network.
In an embodiment of the invention, a traffic simulation road network comprises at least one control signalized intersection. Specifically, as shown in fig. 2, reference numerals 1 to 9 in the drawing denote 9 signal intersections, and the whole shown in fig. 2 is a traffic network.
Step S300: and acquiring real-time traffic running state data of at least one control signalized intersection.
In the embodiment of the present invention, the traffic operation state data may further include part or all of the device ID, the detection time, the traffic flow, the vehicle type distribution, the vehicle time occupancy, the headway, the vehicle acceleration and deceleration, the vehicle space occupancy, the vehicle speed, the vehicle length, the headway, the queuing length, and the number of parking times.
Step S400: and performing parameter checking on the simulation parameters in the traffic simulation road network according to the traffic running state data to obtain the optimized traffic simulation road network.
In the embodiment of the present invention, one or more of the traffic state data may be used to perform parameter checking on simulation parameters in a traffic simulation road network, where, taking the use of headway and vehicle acceleration and deceleration as an example, as shown in fig. 3, step S400 may include the following steps:
step S401: and acquiring the value ranges of the actual headway time distance parameter and the vehicle acceleration and deceleration parameter, and preliminarily checking the headway time distance parameter and the vehicle acceleration and deceleration parameter in the traffic simulation road network according to the value ranges.
Step S402: and observing the traffic simulation road network, acquiring the simulation headway and the vehicle acceleration and deceleration, and comparing and analyzing the simulation headway and the vehicle acceleration and deceleration with the actual headway and the vehicle acceleration and deceleration.
Step S403: and if the difference between the simulated headway and the vehicle acceleration and deceleration and the difference between the actual headway and the vehicle acceleration and deceleration are within a preset range, finishing parameter checking to obtain the optimized traffic simulation road network.
And if the difference between the simulated headway and the vehicle acceleration and deceleration and the difference between the actual headway and the vehicle acceleration and deceleration are not within the preset range, repeating the steps S401-S403 until the difference is within the preset range.
Step S500: and inputting the traffic state obtained by observing the optimized traffic simulation road network into the value network to obtain the evaluation value of each signal control scheme under the traffic state, and updating the parameters of the value network by adopting a time difference algorithm. In the embodiment of the invention, the value network is a pre-constructed convolutional neural network and is used for approximating the action value function.
In an embodiment of the invention, the value network may be a convolutional neural network for a mere motion cost function, an input layer, a convolutional layer, a fully-connected layer, and an output layer; it may also be a dual delay deep Q network, including an action value network for selecting actions
Figure BDA0002830591200000091
And a target value network for calculating the Q value
Figure BDA0002830591200000092
Wherein the parameter ω ═ ω1,…,ωi,…,ωN]Expressed as N parameters of the motion value network, parameter ω '═ ω'1,…,ω′i,…,ω′N]Expressed as N parameters of the target value network.
Step S600: and inputting the traffic state into a policy network to obtain the probability value of each signal control scheme, and randomly sampling according to the probability value of each signal control scheme to obtain a signal control scheme. In the embodiment of the invention, the strategy network is a pre-constructed convolutional neural network and is used for approximating the strategy function. In an embodiment of the present invention, a policy network includes an input layer, a convolutional layer, a fully-connected layer, and an output layer.
Step S700: and updating parameters of the strategy network through random strategy gradients based on the evaluation value of each signal control scheme in the traffic state and one signal control scheme.
In the embodiment of the present invention, the traffic state is represented as the maximum number of queued vehicles for each phase j at each signalized intersection, specifically:
Figure BDA0002830591200000101
in the formula (I), the compound is shown in the specification,
Figure BDA0002830591200000102
representing an observed value of a traffic state of a signalized intersection i at a decision time t, wherein i represents the number of each signalized intersection, and belongs to {1, 2.. N }; j is represented as a phase number, j ∈ {1, 2.. n }; t is decision time; l is a lane number; l isjIndicating a set of lanes that can be traversed at phase j; q. q.st,lExpressed as the number of vehicles in line on lane i at decision time t;
the number of the vehicles in the line I at the decision time t is equal to the number of the vehicles in the line I at the decision time t-1 plus or minus the number of the vehicles entering the line or leaving the line at the decision time t, and specifically comprises the following steps:
Figure BDA0002830591200000103
in the formula, qt-1,lExpressed as the number of vehicles in line, V, on lane l at decision time t-1t,lSet of vehicles representing a drive into a lane l at a decision time tAnd v represents whether a certain vehicle is added into the queue at the decision time t, and the judgment of the vehicle adding into the queue or leaving the queue is specifically as follows:
Figure BDA0002830591200000104
in the formula (I), the compound is shown in the specification,
Figure BDA0002830591200000105
the speed of the vehicle v at the decision time t-1 and the decision time t; spThrA speed threshold for determining whether to join the queue;
the joint state of the signalized intersections is expressed as a vector of the observation value of each signalized intersection, and specifically comprises the following steps:
Figure BDA0002830591200000106
in the formula (I), the compound is shown in the specification,
Figure BDA0002830591200000107
expressed as the observed value at decision time t for the ith signalized intersection.
In the embodiment of the present invention, the signal control scheme is divided into action selection of a fixed phase order and action selection of a variable phase order according to whether the order of the phases is changed;
for a fixed phase order, the policy network at signalized intersection i has two optional actions in the decision time t of each iteration, when
Figure BDA0002830591200000111
Time indicates continuation of the current phase; when in use
Figure BDA0002830591200000112
The time indicates ending the current phase and switching to the next phase, specifically:
Figure BDA0002830591200000113
for a variable phase sequence, the policy network at signalized intersection i has n selectable actions at decision time t of each iteration, specifically:
Figure BDA0002830591200000114
if at decision time t, the strategy network at the signalized intersection i decides to continue the current phase, namely
Figure BDA0002830591200000115
Namely, the action at the moment t is the same as the action at the last decision moment p (t), the duration m seconds of the current phase is prolonged, and m is 1 to 5 seconds; the decision time for judging whether to switch the phase at the next time is the time t + m, if the current phase is decided to be ended and switched to the next phase at the time t, the current phase is switched to the next phase
Figure BDA0002830591200000116
That is, the action at the time t is different from the action at the last decision time p (t), and finally a minimum green time G of the phase is setminIf the time Y of the intermediate phase yellow lamp is released again, the phase is switched to the next phase, and the decision time for judging whether the phase is switched next time is t + GminTime + Y + m.
In the embodiment of the invention, an evaluation value calculation function of the value network at the signalized intersection i is defined as a reduction value of the maximum number of queued vehicles, and specifically comprises the following steps:
Figure BDA0002830591200000117
in the formula, Lp(t),lAnd Lt,lThe number of vehicles queued on lane l at decision time p (t) and decision time t, respectively;
or, defining an evaluation value calculation function of the value network at the signalized intersection i as a reduction value of the maximum total delay, specifically:
Figure BDA0002830591200000118
in the formula, Cdp(t),vAnd Cdt,vRespectively accumulating the total delay of the queued vehicles at decision time p (t) and decision time t;
the joint evaluation value calculation function of the value network at a plurality of signal intersections is expressed as the coupling of the evaluation value calculation function of the value network at each signal intersection, and specifically comprises the following steps:
Figure BDA0002830591200000121
wherein J (i) is a set of value networks other than the value network of the signalized intersection i; for the joint reward function, n is a non-negative constant, when n is 0, the value network of the signalized intersection i only considers the evaluation values of other value network sets J (i), and when n is larger, the value network of the signalized intersection i only considers the local evaluation value of the signalized intersection i.
In the embodiment of the invention, when the value network is a double-delay deep Q network, the specific steps of training the value network and the strategy network are as follows:
1) inputting reinforcement learning related parameters: the method comprises the following steps of empirical pool capacity max _ size, small batch size batch _ size, discount rate gamma, action value network learning rate alpha, target value network learning rate beta, strategy network learning rate eta and iteration termination number N.
In the embodiment of the present invention, the specific values of the parameters may be specifically set according to the needs of the actual application scenario and the experience of the user, and here, a set of specific parameter values is provided to help those skilled in the art understand the technical solution: the empirical pool size max _ size is set to 100,000; the small batch size batch _ size is set to 32; discount rate γ is set to 0.75; the value network learning rate alpha is set to 0.0002; setting the target value network learning rate beta to be 0.001, setting the strategy network learning rate eta to be 0.0002 and setting the number N of termination iterations to be 450,000; specifically, the following table shows:
Figure BDA0002830591200000122
2) initializing element and action value network in experience pool E
Figure BDA0002830591200000123
Parameter omega, target value network of
Figure BDA0002830591200000124
Parameter ω' of (1), policy network
Figure BDA0002830591200000125
The parameter θ of (a);
3) obtaining the observed value of each signalized intersection i at the moment t
Figure BDA0002830591200000126
Formed united traffic state stCurrent _ phase of the current phase;
4) when the iteration number i is less than the termination iteration number N, executing the following steps:
41) network according to policy
Figure BDA0002830591200000131
Calculating probability distribution, and randomly sampling according to the probability distribution to obtain signal control scheme
Figure BDA0002830591200000132
42) On-signal control scheme
Figure BDA0002830591200000133
When the current _ phase state is the current _ phase state, the current phase duration is prolonged by m seconds; on-signal control scheme
Figure BDA0002830591200000134
Is not currentWhen the phase state current _ phase is in, a section of minimum green time G of the current phase is releasedminAfter the time Y of the intermediate phase yellow light is finished, switching to the jth phase;
43) calculating evaluation value of evaluation network at signalized intersection i
Figure BDA0002830591200000135
And constructing a joint evaluation value calculation function
Figure BDA0002830591200000136
And calculating the observed value of each signalized intersection i at the moment t +1
Figure BDA0002830591200000137
Formed united traffic state st+1
44) When the capacity of the experience pool E is the maximum capacity max _ size, removing the earlier experience at the moment from the experience pool E; otherwise, experience will be
Figure BDA0002830591200000138
Putting the obtained product into an experience pool E;
45) when the capacity of the experience pool is larger than the small batch experience quantity batch _ size, the following steps are executed:
451) randomly sampling small batches from an experience pool E according to the experience priority value;
452) for each small batch of experience samples
Figure BDA0002830591200000139
Network for respectively calculating action value at signalized intersection i
Figure BDA00028305912000001310
Value of (D) and value of target value network
Figure BDA00028305912000001311
And obtaining the size of a baseline b value;
453) according to
Figure BDA00028305912000001312
Calculating the value of the loss function and using the Adam optimizer gradient descent method
Figure BDA00028305912000001313
Minimizing a loss function to update the parameter ω;
454) updating the target value network according to ω '═ β ω' + (1- β) ω
Figure BDA00028305912000001314
Parameter ω';
455) for each small batch of experience samples
Figure BDA00028305912000001315
Computing policy network based on Monte Carlo approximation method
Figure BDA0002830591200000141
Random strategy gradient of
Figure BDA0002830591200000142
And using a gradient ascent algorithm
Figure BDA0002830591200000143
Updating the parameter theta;
46) the traffic state s at the moment t +1 is calculatedt+1Is assigned to stAnd repeats step 451) to step 455).
Example 2
Fig. 4 is a schematic block diagram of a traffic signal control system based on a random policy gradient according to an embodiment of the present invention, which may be used to implement the traffic signal control method based on a random policy gradient according to embodiment 1 or any alternative implementation manner thereof. As shown in fig. 4, the system includes: the simulation system comprises a first data acquisition module 10, a simulation drawing module 20, a second data acquisition module 30, a simulation checking module 40, an action evaluation module 50, an action sampling module 60 and a signal control module 70. Wherein the content of the first and second substances,
the first data acquisition module 10 is configured to acquire static road network data of at least one control signalized intersection.
The simulation drawing module 20 is used for drawing the traffic simulation road network according to the static road network data in a visualized manner.
The second data acquiring module 30 is configured to acquire real-time traffic operation state data of at least one control signalized intersection.
The simulation checking module 40 is configured to perform parameter checking on simulation parameters in the traffic simulation road network according to the traffic operation state data, so as to obtain an optimized traffic simulation road network.
The action evaluation module 50 is configured to input the traffic state obtained by observing the optimized traffic simulation road network into the value network, obtain an evaluation value of each signal control scheme in the traffic state, and update parameters of the value network by using a time difference algorithm. In the embodiment of the invention, the value network is a pre-constructed neural network used for approximating the action value function.
The action sampling module 60 is configured to input the traffic status into the policy network, obtain a probability value of each signal control scheme, and perform random sampling according to the probability value of each signal control scheme to obtain one signal control scheme. In the embodiment of the invention, the strategy network is a pre-constructed neural network and is used for approximating the strategy function.
The signal control module 70 is configured to update parameters of the policy network through a random policy gradient based on the evaluation value of each signal control scheme in the traffic state and one signal control scheme.
An embodiment of the present invention further provides an electronic device, as shown in fig. 5, the electronic device may include a processor 51 and a memory 52, where the processor 51 and the memory 52 may be connected by a bus or in another manner, and fig. 5 takes the connection by the bus as an example.
The processor 51 may be a Central Processing Unit (CPU). The Processor 51 may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, or combinations thereof.
The memory 52 is a non-transitory computer readable storage medium, and can be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the traffic signal control method based on the random strategy gradient in the embodiment of the present invention (e.g., the first data acquisition module 10, the simulation rendering module 20, the second data acquisition module 30, the simulation checking module 40, the action evaluation module 50, the action sampling module 60, and the signal control module 70 shown in fig. 4). The processor 51 executes various functional applications and data processing of the processor by running non-transitory software programs, instructions and modules stored in the memory 52, namely, implements the traffic signal control method based on the random strategy gradient in the above method embodiment.
The memory 52 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created by the processor 51, and the like. Further, the memory 52 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 52 may optionally include memory located remotely from the processor 51, and these remote memories may be connected to the processor 51 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The one or more modules are stored in the memory 52 and, when executed by the processor 51, perform a traffic signal control method based on random policy gradients as in the embodiments shown in fig. 1-3.
The details of the electronic device may be understood by referring to the corresponding descriptions and effects in the embodiments shown in fig. 1 to fig. 3, and are not described herein again.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD), a Solid State Drive (SSD), or the like; the storage medium may also comprise a combination of memories of the kind described above.
It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications therefrom are within the scope of the invention.

Claims (10)

1. A traffic signal control method based on a random strategy gradient is characterized by comprising the following steps:
obtaining static road network data of at least one control signalized intersection;
visually drawing a traffic simulation road network according to the static road network data;
acquiring real-time traffic running state data of the at least one control signalized intersection;
performing parameter checking on simulation parameters in the traffic simulation road network according to the traffic running state data to obtain an optimized traffic simulation road network;
inputting the traffic state obtained by observing the optimized traffic simulation road network into a value network to obtain an evaluation value of each signal control scheme under the traffic state, and updating parameters of the value network by adopting a time difference algorithm; the value network is a pre-constructed convolutional neural network and is used for approximating an action value function;
inputting the traffic state into a policy network to obtain a probability value of each signal control scheme, and randomly sampling according to the probability value of each signal control scheme to obtain a signal control scheme; the strategy network is a pre-constructed convolutional neural network and is used for approximating a strategy function;
updating parameters of the policy network by a random policy gradient based on the evaluation value of each signal control scheme in the traffic state and the one signal control scheme.
2. The traffic signal control method based on the random strategy gradient according to claim 1, wherein the traffic operation state data includes a headway and an acceleration and deceleration of a vehicle, and the step of performing parameter checking on simulation parameters in the traffic simulation road network according to the traffic operation state data to obtain an optimized traffic simulation road network comprises the steps of:
acquiring the value ranges of the actual headway time distance parameter and the vehicle acceleration and deceleration parameter, and primarily checking the headway time distance parameter and the vehicle acceleration and deceleration parameter in the traffic simulation road network according to the value ranges;
observing the traffic simulation road network, acquiring a simulation headway and a vehicle acceleration and deceleration, and comparing and analyzing the simulation headway and the vehicle acceleration and deceleration with an actual headway and an actual vehicle acceleration and deceleration;
if the difference between the simulated headway and the vehicle acceleration and deceleration and the difference between the actual headway and the vehicle acceleration and deceleration are within a preset range, finishing parameter checking to obtain the optimized traffic simulation road network; otherwise, repeating the steps until the difference is within the preset range.
3. The random policy gradient-based traffic signal control method according to claim 2, wherein the static road network data includes part or all of road grade, number of lanes, lane width, lane functional division, road segment length, road marking, intersection type, adjacent intersection information, signal equipment number, phase information, and phase sequence information;
the traffic operation state data may further include a device ID, a detection time, a traffic flow, a vehicle type distribution, a vehicle time occupancy, a vehicle space occupancy, a vehicle speed, a vehicle length, a vehicle head interval, a queuing length, and a number of times of parking, in part or in whole.
4. A traffic signal control method based on a random strategy gradient according to any one of claims 1-3, characterized in that the traffic state is represented as the maximum number of queued vehicles per phase j at each signalized intersection, in particular:
Figure FDA0002830591190000021
in the formula (I), the compound is shown in the specification,
Figure FDA0002830591190000022
representing an observed value of a traffic state of a signalized intersection i at a decision time t, wherein i represents the number of each signalized intersection, and belongs to {1, 2.. N }; j is represented as a phase number, j ∈ {1, 2.. n }; t is decision time; l is a lane number; l isjIndicating a set of lanes that can be traversed at phase j; q. q.st,lExpressed as the number of vehicles in line on lane i at decision time t;
the number of the vehicles in the line I at the decision time t is equal to the number of the vehicles in the line I at the decision time t-1 plus or minus the number of the vehicles entering the line or leaving the line at the decision time t, and specifically comprises the following steps:
Figure FDA0002830591190000023
in the formula, qt-1,lExpressed as the number of vehicles in line, V, on lane l at decision time t-1t,lA set of vehicles entering the lane l at the decision time t is shown, v shows whether a certain vehicle is added to the queue at the decision time t, and the judgment is carried outThe method for adding or leaving the queue of the disconnected vehicle specifically comprises the following steps:
Figure FDA0002830591190000024
in the formula (I), the compound is shown in the specification,
Figure FDA0002830591190000025
the speed of the vehicle v at the decision time t-1 and the decision time t; spThrA speed threshold for determining whether to join the queue;
the joint state of the signalized intersections is expressed as a vector of the observation value of each signalized intersection, and specifically comprises the following steps:
Figure FDA0002830591190000026
in the formula (I), the compound is shown in the specification,
Figure FDA0002830591190000027
expressed as the observed value at decision time t for the ith signalized intersection.
5. The random strategy gradient-based traffic signal control method according to claim 4, wherein the signal control scheme is divided into a fixed phase order action selection and a variable phase order action selection according to whether the phase order is changed;
for a fixed phase order, the policy network at signalized intersection i has two optional actions in the decision time t of each iteration, when
Figure FDA0002830591190000031
Time indicates continuation of the current phase; when in use
Figure FDA0002830591190000032
Time means ending the current phase and switching to the next phaseThe body is as follows:
Figure FDA0002830591190000033
for a variable phase sequence, the policy network at signalized intersection i has n selectable actions at decision time t of each iteration, specifically:
Figure FDA0002830591190000034
if at decision time t, the strategy network at the signalized intersection i decides to continue the current phase, namely
Figure FDA0002830591190000035
Namely, the action at the moment t is the same as the action at the last decision moment p (t), the duration m seconds of the current phase is prolonged, and m is 1 to 5 seconds; the decision time for judging whether to switch the phase at the next time is the time t + m, if the current phase is decided to be ended and switched to the next phase at the time t, the current phase is switched to the next phase
Figure FDA0002830591190000036
That is, the action at the time t is different from the action at the last decision time p (t), and finally a minimum green time G of the phase is setminIf the time Y of the intermediate phase yellow lamp is released again, the phase is switched to the next phase, and the decision time for judging whether the phase is switched next time is t + GminTime + Y + m.
6. The traffic signal control method based on the random strategy gradient according to claim 5, wherein an evaluation value calculation function of the value network at the signalized intersection i is defined as a reduction value of the maximum number of queued vehicles, specifically:
Figure FDA0002830591190000037
in the formula, Lp(t),lAnd Lt,lThe number of vehicles queued on lane l at decision time p (t) and decision time t, respectively;
or, defining an evaluation value calculation function of the value network at the signalized intersection i as a reduction value of the maximum total delay, specifically:
Figure FDA0002830591190000038
in the formula, Cdp(t),vAnd Cdt,vRespectively accumulating the total delay of the queued vehicles at decision time p (t) and decision time t;
the joint evaluation value calculation function of the value network at a plurality of signal intersections is expressed as the coupling of the evaluation value calculation function of the value network at each signal intersection, and specifically comprises the following steps:
Figure FDA0002830591190000041
wherein J (i) is a set of value networks other than the value network of the signalized intersection i; for the joint reward function, n is a non-negative constant, when n is 0, the value network of the signalized intersection i only considers the evaluation values of other value network sets J (i), and when n is larger, the value network of the signalized intersection i only considers the local evaluation value of the signalized intersection i.
7. The method of claim 6, wherein the value network is a dual delay depth Q network comprising an action value network for selecting an action
Figure FDA0002830591190000042
And a target value network for calculating the Q value
Figure FDA0002830591190000043
Parameter ω ═ ω1,…,ωi,…,ωN]Expressed as N parameters of the motion value network, parameter ω '═ ω'1,…,ω′i,…,ω′N]N parameters expressed as a target value network;
training the value network and the strategy network, comprising the steps of:
1) inputting reinforcement learning related parameters: the method comprises the following steps of (1) obtaining an experience pool capacity max _ size, a small batch size batch _ size, a discount rate gamma, an action value network learning rate alpha, a target value network learning rate beta, a strategy network learning rate eta and a termination iteration number N;
2) initializing element and action value network in experience pool E
Figure FDA0002830591190000044
Parameter omega, target value network of
Figure FDA0002830591190000045
Parameter ω' of (1), policy network
Figure FDA0002830591190000046
The parameter θ of (a);
3) obtaining the observed value of each signalized intersection i at the moment t
Figure FDA0002830591190000047
Formed united traffic state stCurrent _ phase of the current phase;
4) when the iteration number i is less than the termination iteration number N, executing the following steps:
41) network according to policy
Figure FDA0002830591190000048
Calculating probability distribution, and randomly sampling according to the probability distribution to obtain signal control scheme
Figure FDA0002830591190000049
42) When the signal controls the scheme
Figure FDA00028305911900000410
When the current _ phase state is the current _ phase state, the current phase duration is prolonged by m seconds; when the signal controls the scheme
Figure FDA00028305911900000411
When the current _ phase state is not in the current phase state, releasing a section of minimum green time G of the current phaseminAfter the time Y of the intermediate phase yellow light is finished, switching to the jth phase;
43) calculating evaluation value of evaluation network at signalized intersection i
Figure FDA0002830591190000051
And constructing a joint evaluation value calculation function
Figure FDA0002830591190000052
And calculating the observed value of each signalized intersection i at the moment t +1
Figure FDA0002830591190000053
Formed united traffic state st+1
44) When the capacity of the experience pool E is the maximum capacity max _ size, removing the earlier experience at the moment from the experience pool E; otherwise, experience will be
Figure FDA0002830591190000054
Putting the obtained product into an experience pool E;
45) when the capacity of the experience pool is larger than the small batch experience quantity batch _ size, the following steps are executed:
451) randomly sampling small batches from an experience pool E according to the experience priority value;
452) for each small batch of experience samples
Figure FDA0002830591190000055
Network for respectively calculating action value at signalized intersection i
Figure FDA0002830591190000056
Value of (D) and value of target value network
Figure FDA0002830591190000057
And obtaining the size of a baseline b value;
453) according to
Figure FDA0002830591190000058
Calculating the value of the loss function and using the Adam optimizer gradient descent method
Figure FDA0002830591190000059
Minimizing a loss function to update the parameter ω;
454) updating the target value network according to ω '═ β ω' + (1- β) ω
Figure FDA00028305911900000510
Parameter ω';
455) for each small batch of experience samples
Figure FDA00028305911900000511
Computing policy network based on Monte Carlo approximation method
Figure FDA00028305911900000512
Random strategy gradient of
Figure FDA00028305911900000513
And using a gradient ascent algorithm
Figure FDA00028305911900000514
Updating the parameter theta;
46) the traffic state s at the moment t +1 is calculatedt+1Assignment of valueTo stAnd repeats step 451) to step 455).
8. A traffic signal control system based on a stochastic strategy gradient, comprising:
the first data acquisition module is used for acquiring static road network data of at least one control signalized intersection;
the simulation drawing module is used for drawing a traffic simulation road network according to the static road network data in a visualized manner;
the second data acquisition module is used for acquiring real-time traffic running state data of the at least one control signalized intersection;
the simulation checking module is used for performing parameter checking on the simulation parameters in the traffic simulation road network according to the traffic running state data to obtain an optimized traffic simulation road network;
the action evaluation module is used for inputting the traffic state obtained by observing the optimized traffic simulation road network into a value network to obtain the evaluation value of each signal control scheme under the traffic state, and updating the parameters of the value network by adopting a time difference algorithm; the value network is a pre-constructed neural network and is used for approximating an action value function;
the action sampling module is used for inputting the traffic state into a policy network to obtain the probability value of each signal control scheme and randomly sampling according to the probability value of each signal control scheme to obtain a signal control scheme; the strategy network is a pre-constructed neural network and is used for approximating a strategy function;
and the signal control module is used for updating the parameters of the strategy network through random strategy gradients based on the evaluation value of each signal control scheme in the traffic state and the signal control scheme.
9. An electronic device, comprising: a memory and a processor, the memory and the processor being communicatively connected to each other, the memory storing therein computer instructions, and the processor executing the computer instructions to perform the traffic signal control method based on random strategy gradients of any one of claims 1 to 7.
10. A computer-readable storage medium storing computer instructions for causing a computer to perform the random policy gradient-based traffic signal control method of any one of claims 1 to 7.
CN202011459044.2A 2020-12-11 2020-12-11 Traffic signal control method and system based on random strategy gradient and electronic equipment Active CN112614343B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202011459044.2A CN112614343B (en) 2020-12-11 2020-12-11 Traffic signal control method and system based on random strategy gradient and electronic equipment
PCT/CN2021/124593 WO2022121510A1 (en) 2020-12-11 2021-10-19 Stochastic policy gradient-based traffic signal control method and system, and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011459044.2A CN112614343B (en) 2020-12-11 2020-12-11 Traffic signal control method and system based on random strategy gradient and electronic equipment

Publications (2)

Publication Number Publication Date
CN112614343A true CN112614343A (en) 2021-04-06
CN112614343B CN112614343B (en) 2022-08-19

Family

ID=75234428

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011459044.2A Active CN112614343B (en) 2020-12-11 2020-12-11 Traffic signal control method and system based on random strategy gradient and electronic equipment

Country Status (2)

Country Link
CN (1) CN112614343B (en)
WO (1) WO2022121510A1 (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113362618A (en) * 2021-06-03 2021-09-07 东南大学 Multi-mode traffic adaptive signal control method and device based on strategy gradient
CN114038217A (en) * 2021-10-28 2022-02-11 李迎 Traffic signal configuration and control method
CN114446066A (en) * 2021-12-30 2022-05-06 银江技术股份有限公司 Road signal control method and device
CN114613159A (en) * 2022-02-10 2022-06-10 北京箩筐时空数据技术有限公司 Traffic signal lamp control method, device and equipment based on deep reinforcement learning
WO2022121510A1 (en) * 2020-12-11 2022-06-16 多伦科技股份有限公司 Stochastic policy gradient-based traffic signal control method and system, and electronic device
CN114743388A (en) * 2022-03-22 2022-07-12 中山大学·深圳 Multi-intersection signal self-adaptive control method based on reinforcement learning
CN114898576A (en) * 2022-05-10 2022-08-12 阿波罗智联(北京)科技有限公司 Traffic control signal generation method and target network model training method
CN115100850A (en) * 2022-04-21 2022-09-23 浙江省交通投资集团有限公司智慧交通研究分公司 Hybrid traffic flow control method, medium, and apparatus based on deep reinforcement learning
CN117275259A (en) * 2023-11-20 2023-12-22 北京航空航天大学 Multi-intersection cooperative signal control method based on field information backtracking

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115331428B (en) * 2022-07-05 2023-10-17 成利吉(厦门)智能股份有限公司 Traffic signal optimization method based on rule base
CN115171408B (en) * 2022-07-08 2023-05-30 华侨大学 Traffic signal optimization control method
CN115440042B (en) * 2022-09-02 2024-02-02 吉林大学 Multi-agent constraint strategy optimization-based signalless intersection cooperative control method
CN115762128B (en) * 2022-09-28 2024-03-29 南京航空航天大学 Deep reinforcement learning traffic signal control method based on self-attention mechanism
CN116597672B (en) * 2023-06-14 2024-02-13 南京云创大数据科技股份有限公司 Regional signal lamp control method based on multi-agent near-end strategy optimization algorithm
CN117151441B (en) * 2023-10-31 2024-01-30 长春工业大学 Replacement flow workshop scheduling method based on actor-critique algorithm
CN117173914B (en) * 2023-11-03 2024-01-26 中泰信合智能科技有限公司 Road network signal control unit decoupling method, device and medium for simplifying complex model
CN117671977A (en) * 2024-02-01 2024-03-08 银江技术股份有限公司 Signal lamp control method, system, device and medium for traffic trunk line

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109215355A (en) * 2018-08-09 2019-01-15 北京航空航天大学 A kind of single-point intersection signal timing optimization method based on deeply study
CN109472984A (en) * 2018-12-27 2019-03-15 苏州科技大学 Signalized control method, system and storage medium based on deeply study
CN109559530A (en) * 2019-01-07 2019-04-02 大连理工大学 A kind of multi-intersection signal lamp cooperative control method based on Q value Transfer Depth intensified learning
CN110047278A (en) * 2019-03-30 2019-07-23 北京交通大学 A kind of self-adapting traffic signal control system and method based on deeply study
CN110428615A (en) * 2019-07-12 2019-11-08 中国科学院自动化研究所 Learn isolated intersection traffic signal control method, system, device based on deeply
CN110673602A (en) * 2019-10-24 2020-01-10 驭势科技(北京)有限公司 Reinforced learning model, vehicle automatic driving decision method and vehicle-mounted equipment
CN110930734A (en) * 2019-11-30 2020-03-27 天津大学 Intelligent idle traffic indicator lamp control method based on reinforcement learning
CN111696370A (en) * 2020-06-16 2020-09-22 西安电子科技大学 Traffic light control method based on heuristic deep Q network
CN111833590A (en) * 2019-04-15 2020-10-27 北京京东尚科信息技术有限公司 Traffic signal lamp control method and device and computer readable storage medium

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE102008049568A1 (en) * 2008-09-30 2010-04-08 Siemens Aktiengesellschaft A method of optimizing traffic control at a traffic signal controlled node in a road traffic network
US8571743B1 (en) * 2012-04-09 2013-10-29 Google Inc. Control of vehicles based on auditory signals
CN105955930A (en) * 2016-05-06 2016-09-21 天津科技大学 Guidance-type policy search reinforcement learning algorithm
CN111311945B (en) * 2020-02-20 2021-07-09 南京航空航天大学 Driving decision system and method fusing vision and sensor information
CN111737826B (en) * 2020-07-17 2020-11-24 北京全路通信信号研究设计院集团有限公司 Rail transit automatic simulation modeling method and device based on reinforcement learning
CN112052456A (en) * 2020-08-31 2020-12-08 浙江工业大学 Deep reinforcement learning strategy optimization defense method based on multiple intelligent agents
CN112614343B (en) * 2020-12-11 2022-08-19 多伦科技股份有限公司 Traffic signal control method and system based on random strategy gradient and electronic equipment

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109215355A (en) * 2018-08-09 2019-01-15 北京航空航天大学 A kind of single-point intersection signal timing optimization method based on deeply study
CN109472984A (en) * 2018-12-27 2019-03-15 苏州科技大学 Signalized control method, system and storage medium based on deeply study
CN109559530A (en) * 2019-01-07 2019-04-02 大连理工大学 A kind of multi-intersection signal lamp cooperative control method based on Q value Transfer Depth intensified learning
CN110047278A (en) * 2019-03-30 2019-07-23 北京交通大学 A kind of self-adapting traffic signal control system and method based on deeply study
CN111833590A (en) * 2019-04-15 2020-10-27 北京京东尚科信息技术有限公司 Traffic signal lamp control method and device and computer readable storage medium
CN110428615A (en) * 2019-07-12 2019-11-08 中国科学院自动化研究所 Learn isolated intersection traffic signal control method, system, device based on deeply
CN110673602A (en) * 2019-10-24 2020-01-10 驭势科技(北京)有限公司 Reinforced learning model, vehicle automatic driving decision method and vehicle-mounted equipment
CN110930734A (en) * 2019-11-30 2020-03-27 天津大学 Intelligent idle traffic indicator lamp control method based on reinforcement learning
CN111696370A (en) * 2020-06-16 2020-09-22 西安电子科技大学 Traffic light control method based on heuristic deep Q network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
柳爽: "基于非端到端强化学习的单点信号控制方法", 《中国优秀硕士学位论文库》 *
龙忠顺等: "网联环境下基于深度强化学习的单路口交通信号控制优化", 《工业控制计算机》 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022121510A1 (en) * 2020-12-11 2022-06-16 多伦科技股份有限公司 Stochastic policy gradient-based traffic signal control method and system, and electronic device
CN113362618B (en) * 2021-06-03 2022-08-09 东南大学 Multi-mode traffic adaptive signal control method and device based on strategy gradient
CN113362618A (en) * 2021-06-03 2021-09-07 东南大学 Multi-mode traffic adaptive signal control method and device based on strategy gradient
CN114038217A (en) * 2021-10-28 2022-02-11 李迎 Traffic signal configuration and control method
CN114038217B (en) * 2021-10-28 2023-11-17 李迎 Traffic signal configuration and control method
CN114446066A (en) * 2021-12-30 2022-05-06 银江技术股份有限公司 Road signal control method and device
CN114613159A (en) * 2022-02-10 2022-06-10 北京箩筐时空数据技术有限公司 Traffic signal lamp control method, device and equipment based on deep reinforcement learning
CN114613159B (en) * 2022-02-10 2023-07-28 北京箩筐时空数据技术有限公司 Traffic signal lamp control method, device and equipment based on deep reinforcement learning
CN114743388A (en) * 2022-03-22 2022-07-12 中山大学·深圳 Multi-intersection signal self-adaptive control method based on reinforcement learning
CN115100850A (en) * 2022-04-21 2022-09-23 浙江省交通投资集团有限公司智慧交通研究分公司 Hybrid traffic flow control method, medium, and apparatus based on deep reinforcement learning
CN114898576A (en) * 2022-05-10 2022-08-12 阿波罗智联(北京)科技有限公司 Traffic control signal generation method and target network model training method
CN114898576B (en) * 2022-05-10 2023-12-19 阿波罗智联(北京)科技有限公司 Traffic control signal generation method and target network model training method
CN117275259A (en) * 2023-11-20 2023-12-22 北京航空航天大学 Multi-intersection cooperative signal control method based on field information backtracking
CN117275259B (en) * 2023-11-20 2024-02-06 北京航空航天大学 Multi-intersection cooperative signal control method based on field information backtracking

Also Published As

Publication number Publication date
CN112614343B (en) 2022-08-19
WO2022121510A1 (en) 2022-06-16

Similar Documents

Publication Publication Date Title
CN112614343B (en) Traffic signal control method and system based on random strategy gradient and electronic equipment
CN111696370B (en) Traffic light control method based on heuristic deep Q network
CN112669629B (en) Real-time traffic signal control method and device based on deep reinforcement learning
CN112289034A (en) Deep neural network robust traffic prediction method based on multi-mode space-time data
CN112907970B (en) Variable lane steering control method based on vehicle queuing length change rate
CN113223305B (en) Multi-intersection traffic light control method and system based on reinforcement learning and storage medium
WO2021051930A1 (en) Signal adjustment method and apparatus based on action prediction model, and computer device
CN114758497B (en) Adaptive parking lot variable entrance and exit control method, device and storage medium
CN114463997A (en) Lantern-free intersection vehicle cooperative control method and system
CN114419884B (en) Self-adaptive signal control method and system based on reinforcement learning and phase competition
CN115862322A (en) Vehicle variable speed limit control optimization method, system, medium and equipment
CN115951587A (en) Automatic driving control method, device, equipment, medium and automatic driving vehicle
CN113724507A (en) Traffic control and vehicle induction cooperation method and system based on deep reinforcement learning
CN113515892A (en) Multi-agent traffic simulation parallel computing method and device
US20230162539A1 (en) Driving decision-making method and apparatus and chip
Chentoufi et al. A hybrid particle swarm optimization and tabu search algorithm for adaptive traffic signal timing optimization
Zhang et al. PlanLight: learning to optimize traffic signal control with planning and iterative policy improvement
CN115472023B (en) Intelligent traffic light control method and device based on deep reinforcement learning
CN115019523B (en) Deep reinforcement learning traffic signal coordination optimization control method based on minimized pressure difference
CN115981302A (en) Vehicle following lane change behavior decision-making method and device and electronic equipment
CN115743168A (en) Model training method for lane change decision, target lane determination method and device
Chen et al. Traffic signal optimization control method based on adaptive weighted averaged double deep Q network
CN114299714B (en) Multi-turn-channel coordination control method based on different strategy reinforcement learning
CN114639255B (en) Traffic signal control method, device, equipment and medium
CN116564078A (en) Method and equipment for controlling intersection without signal lamp based on deep reinforcement learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Zheng Peiyu

Inventor after: Zhang Daoyang

Inventor after: Tao Gang

Inventor after: Chen Bo

Inventor after: Li Zhibin

Inventor after: Chen Bing

Inventor after: Yang Guang

Inventor before: Zheng Peiyu

Inventor before: Tao Gang

Inventor before: Chen Bo

Inventor before: Li Zhibin

Inventor before: Chen Bing

Inventor before: Yang Guang