Disclosure of Invention
The invention provides a traffic signal control method, system and electronic equipment based on random strategy gradients, and aims to solve the problems that in the prior art, limited state-action pairs can be processed by using traditional reinforcement learning to control traffic signals, huge state space cannot be processed, and dimension explosion can occur due to overlarge state space, so that the efficiency of strategy learning is low and the accuracy is low.
According to a first aspect, the invention provides a traffic signal control method based on a random strategy gradient, comprising the following steps:
obtaining static road network data of at least one control signalized intersection;
visually drawing a traffic simulation road network according to the static road network data;
acquiring real-time traffic running state data of at least one control signalized intersection;
performing parameter checking on simulation parameters in the traffic simulation road network according to the traffic running state data to obtain an optimized traffic simulation road network;
inputting the traffic state obtained by observing the optimized traffic simulation road network into a value network to obtain an evaluation value of each signal control scheme under the traffic state, and updating parameters of the value network by adopting a time difference algorithm; the value network is a pre-constructed convolutional neural network and is used for approximating an action value function;
inputting the traffic state into a policy network to obtain a probability value of each signal control scheme, and randomly sampling according to the probability value of each signal control scheme to obtain a signal control scheme; the strategy network is a pre-constructed convolutional neural network and is used for approximating a strategy function;
and updating parameters of the strategy network through random strategy gradients based on the evaluation value of each signal control scheme in the traffic state and one signal control scheme.
Optionally, the step of obtaining the optimized traffic simulation road network by checking the simulation parameters in the traffic simulation road network according to the traffic operation state data, where the traffic operation state data includes a headway and an acceleration/deceleration of a vehicle, includes:
acquiring the value ranges of the actual headway time distance parameter and the vehicle acceleration and deceleration parameter, and primarily checking the headway time distance parameter and the vehicle acceleration and deceleration parameter in the traffic simulation road network according to the value ranges;
observing a traffic simulation road network, acquiring a simulation headway and a vehicle acceleration and deceleration, and comparing and analyzing the simulation headway and the vehicle acceleration and deceleration with an actual headway and the vehicle acceleration and deceleration;
if the difference between the simulated headway and the vehicle acceleration and deceleration and the difference between the actual headway and the vehicle acceleration and deceleration are within the preset range, finishing parameter checking to obtain an optimized traffic simulation road network; otherwise, repeating the steps until the difference is within the preset range.
Optionally, the static road network data includes part or all of road grade, number of lanes, lane width, lane function division, road length, road marking, intersection type, adjacent intersection information, signal equipment number, phase information, and phase sequence information;
the traffic operation state data may further include a device ID, a detection time, a traffic volume, a vehicle type distribution, a vehicle time occupancy, a vehicle space occupancy, a vehicle speed, a vehicle length, a vehicle head interval, a queuing length, and a number of times of parking, in part or in whole.
Optionally, the traffic state is expressed as a maximum number of queued vehicles per phase j at each signalized intersection, specifically:
in the formula (I), the compound is shown in the specification,
representing an observed value of a traffic state of a signalized intersection i at a decision time t, wherein i represents the number of each signalized intersection, and belongs to {1, 2.. N }; j is represented as a phase number, j ∈ {1, 2.. n }; t is decision time; l is a lane number; l is
jIndicating a set of lanes that can be traversed at phase j; q. q.s
t,lExpressed as the number of vehicles in line on lane i at decision time t;
the number of the vehicles in the line I at the decision time t is equal to the number of the vehicles in the line I at the decision time t-1 plus or minus the number of the vehicles entering the line or leaving the line at the decision time t, and specifically comprises the following steps:
in the formula, qt-1,lExpressed as the number of vehicles in line, V, on lane l at decision time t-1t,lThe method comprises the following steps of representing a vehicle set which enters a lane l at a decision time t, representing whether a certain vehicle is added into a queue at the decision time t, judging whether the vehicle is added into the queue or leaves the queue, and specifically comprising the following steps:
in the formula (I), the compound is shown in the specification,
the speed of the vehicle v at the decision time t-1 and the decision time t; sp
ThrA speed threshold for determining whether to join the queue;
the joint state of the signalized intersections is expressed as a vector of the observation value of each signalized intersection, and specifically comprises the following steps:
in the formula (I), the compound is shown in the specification,
expressed as the observed value at decision time t for the ith signalized intersection.
Alternatively, the signal control scheme is divided into a fixed phase order of action selection and a variable phase order of action selection according to whether the order of the phases is changed;
for a fixed phase order, the policy network at signalized intersection i has two optional actions in the decision time t of each iteration, when
Time indicates continuation of the current phase; when in use
The time indicates ending the current phase and switching to the next phase, specifically:
for a variable phase sequence, the policy network at signalized intersection i has n selectable actions at decision time t of each iteration, specifically:
if at decision time t, the strategy network at the signalized intersection i decides to continue the current phase, namely
Namely, the action at the moment t is the same as the action at the last decision moment p (t), the duration m seconds of the current phase is prolonged, and m is 1 to 5 seconds; the decision time for judging whether to switch the phase at the next time is the time t + m, if the current phase is decided to be ended and switched to the next phase at the time t, the current phase is switched to the next phase
That is, the action at the time t is different from the action at the last decision time p (t), and finally a minimum green time G of the phase is set
minIf the time Y of the intermediate phase yellow lamp is released again, the phase is switched to the next phase, and the decision time for judging whether the phase is switched next time is t + G
minTime + Y + m.
Optionally, the evaluation value calculation function of the value network at the signalized intersection i is defined as a reduction value of the maximum number of queued vehicles, and specifically includes:
in the formula, Lp(t),lAnd Lt,lThe number of vehicles queued on lane l at decision time p (t) and decision time t, respectively;
or, defining an evaluation value calculation function of the value network at the signalized intersection i as a reduction value of the maximum total delay, specifically:
in the formula, Cdp(t),vAnd Cdt,vRespectively accumulating the total delay of the queued vehicles at decision time p (t) and decision time t;
the joint evaluation value calculation function of the value network at a plurality of signal intersections is expressed as the coupling of the evaluation value calculation function of the value network at each signal intersection, and specifically comprises the following steps:
wherein J (i) is a set of value networks other than the value network of the signalized intersection i; for the joint reward function, n is a non-negative constant, when n is 0, the value network of the signalized intersection i only considers the evaluation values of other value network sets J (i), and when n is larger, the value network of the signalized intersection i only considers the local evaluation value of the signalized intersection i.
Optionally, the value network is a dual delay deep Q network, comprising an action value network for selecting an action
And a target value network for calculating the Q value
Parameter ω ═ ω
1,…,ω
i,…,ω
N]Expressed as N parameters of the motion value network, parameter ω '═ ω'
1,…,ω′
i,…,ω′
N]N parameters expressed as a target value network;
training a value network and a strategy network, comprising the steps of:
1) inputting reinforcement learning related parameters: the method comprises the following steps of (1) obtaining an experience pool capacity max _ size, a small batch size batch _ size, a discount rate gamma, an action value network learning rate alpha, a target value network learning rate beta, a strategy network learning rate eta and a termination iteration number N;
2) initializing element and action value network in experience pool E
Parameter omega, target value network of
Parameter ω' of (1), policy network
The parameter θ of (a);
3) obtaining the observed value of each signalized intersection i at the moment t
Formed united traffic state s
tCurrent _ phase of the current phase;
4) when the iteration number i is less than the termination iteration number N, executing the following steps:
41) network according to policy
Calculating probability distribution, and randomly sampling according to the probability distribution to obtain signal control scheme
42) On-signal control scheme
When the current _ phase state is the current _ phase state, the current phase duration is prolonged by m seconds; on-signal control scheme
When the current _ phase state is not in the current phase state, releasing a section of minimum green time G of the current phase
minAfter the time Y of the intermediate phase yellow light is finished, switching to the jth phase;
43) calculating evaluation value of evaluation network at signalized intersection i
And constructing a joint evaluation value calculation function
And calculating the observed value of each signalized intersection i at the moment t +1
Formed united traffic state s
t+1;
44) When the capacity of the experience pool E is the maximum capacity max _ size, removing the earlier experience at the moment from the experience pool E; otherwise, experience will be
Putting the obtained product into an experience pool E;
45) when the capacity of the experience pool is larger than the small batch experience quantity batch _ size, the following steps are executed:
451) randomly sampling small batches from an experience pool E according to the experience priority value;
452) for each small batch of experience samples
Network for respectively calculating action value at signalized intersection i
Value of (D) and value of target value network
And obtaining the size of a baseline b value;
453) according to
Calculating the value of the loss function and using the Adam optimizer gradient descent method
Minimizing a loss function to update the parameter ω;
454) updating the target value network according to ω '═ β ω' + (1- β) ω
Parameter ω';
455) for theEach small batch of experience samples
Computing policy network based on Monte Carlo approximation method
Random strategy gradient of
And using a gradient ascent algorithm
Updating the parameter theta;
46) the traffic state s at the moment t +1 is calculatedt+1Is assigned to stAnd repeats step 451) to step 455).
According to a second aspect, the present invention provides a traffic signal control system based on a stochastic strategy gradient, comprising:
the first data acquisition module is used for acquiring static road network data of at least one control signalized intersection;
the simulation drawing module is used for drawing the traffic simulation road network according to the static road network data in a visualized manner;
the second data acquisition module is used for acquiring real-time traffic running state data of at least one control signalized intersection;
the simulation checking module is used for performing parameter checking on simulation parameters in the traffic simulation road network according to the traffic running state data to obtain an optimized traffic simulation road network;
the action evaluation module is used for inputting the traffic state obtained by observing the optimized traffic simulation road network into the value network to obtain the evaluation value of each signal control scheme under the traffic state, and updating the parameters of the value network by adopting a time difference algorithm; the value network is a pre-constructed neural network and is used for approximating an action value function;
the action sampling module is used for inputting the traffic state into the policy network to obtain the probability value of each signal control scheme and randomly sampling according to the probability value of each signal control scheme to obtain a signal control scheme; the strategy network is a pre-constructed neural network and is used for approximating a strategy function;
and the signal control module is used for updating parameters of the strategy network through random strategy gradients based on the evaluation value of each signal control scheme and one signal control scheme under the traffic state.
According to a third aspect, the invention provides an electronic device comprising: the traffic signal control method based on the random strategy gradient comprises a memory and a processor, wherein the memory and the processor are mutually connected in a communication mode, the memory stores computer instructions, and the processor executes the computer instructions so as to execute the traffic signal control method based on the random strategy gradient.
According to a fourth aspect, the present invention provides a computer-readable storage medium having stored thereon computer instructions for causing a computer to execute the random policy gradient-based traffic signal control method of the first aspect.
The invention has the beneficial effects that:
1. compared with the parameterized representation of a Q value function in the prior art, the traffic signal control method based on the random strategy gradient provided by the invention has the advantages that the parameterization of the strategy function is simpler, the convergence is better, the learning efficiency and the accuracy are higher, and the problem of dimension explosion generally can not occur.
2. The traffic signal control method based on the random strategy gradient provided by the invention can adapt to the nonlinearity, randomness, fuzziness and uncertainty of a traffic system by continuously monitoring, diagnosing, modeling and controlling the intersection in real time.
3. According to the traffic signal control method based on the random strategy gradient, the problem that original traffic data and traffic states are too large can be solved by adopting the convolutional neural network in deep learning, the convolutional neural network inputs original high-dimensional data, the bottom-layer features are combined to form more abstract high-layer features, hidden features in the high-dimensional traffic states are captured, control can be directly carried out according to the input high-dimensional data, the feature representation capability of a state input matrix is improved, and the generalization capability of the method on representation of different traffic states is enhanced.
4. Compared with the traditional timing control and induction control, the traffic signal control method based on the random strategy gradient can respond to the dynamic change of the traffic flow in time, optimize the signal timing scheme for real-time control, finally reduce the driving delay of the road network and improve the traffic efficiency of the road network.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the description of the present invention, it should be noted that the terms "first", "second", and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
Example 1
Fig. 3 shows a flow chart of a traffic signal control method based on a random strategy gradient according to an embodiment of the present invention, and as shown in fig. 3, the method may include the following steps:
step S100: and acquiring static road network data of at least one control signalized intersection.
In an embodiment of the present invention, the static road network data includes part or all of road grade, number of lanes, lane width, lane function division, road length, road marking, intersection type, adjacent intersection information, signal device number, phase information, and phase sequence information.
Step S200: and visually drawing a traffic simulation road network according to the static road network data.
In the embodiment of the invention, microscopic traffic simulation software, such as SUMO, can be used for drawing a traffic simulation road network.
In an embodiment of the invention, a traffic simulation road network comprises at least one control signalized intersection. Specifically, as shown in fig. 2, reference numerals 1 to 9 in the drawing denote 9 signal intersections, and the whole shown in fig. 2 is a traffic network.
Step S300: and acquiring real-time traffic running state data of at least one control signalized intersection.
In the embodiment of the present invention, the traffic operation state data may further include part or all of the device ID, the detection time, the traffic flow, the vehicle type distribution, the vehicle time occupancy, the headway, the vehicle acceleration and deceleration, the vehicle space occupancy, the vehicle speed, the vehicle length, the headway, the queuing length, and the number of parking times.
Step S400: and performing parameter checking on the simulation parameters in the traffic simulation road network according to the traffic running state data to obtain the optimized traffic simulation road network.
In the embodiment of the present invention, one or more of the traffic state data may be used to perform parameter checking on simulation parameters in a traffic simulation road network, where, taking the use of headway and vehicle acceleration and deceleration as an example, as shown in fig. 3, step S400 may include the following steps:
step S401: and acquiring the value ranges of the actual headway time distance parameter and the vehicle acceleration and deceleration parameter, and preliminarily checking the headway time distance parameter and the vehicle acceleration and deceleration parameter in the traffic simulation road network according to the value ranges.
Step S402: and observing the traffic simulation road network, acquiring the simulation headway and the vehicle acceleration and deceleration, and comparing and analyzing the simulation headway and the vehicle acceleration and deceleration with the actual headway and the vehicle acceleration and deceleration.
Step S403: and if the difference between the simulated headway and the vehicle acceleration and deceleration and the difference between the actual headway and the vehicle acceleration and deceleration are within a preset range, finishing parameter checking to obtain the optimized traffic simulation road network.
And if the difference between the simulated headway and the vehicle acceleration and deceleration and the difference between the actual headway and the vehicle acceleration and deceleration are not within the preset range, repeating the steps S401-S403 until the difference is within the preset range.
Step S500: and inputting the traffic state obtained by observing the optimized traffic simulation road network into the value network to obtain the evaluation value of each signal control scheme under the traffic state, and updating the parameters of the value network by adopting a time difference algorithm. In the embodiment of the invention, the value network is a pre-constructed convolutional neural network and is used for approximating the action value function.
In an embodiment of the invention, the value network may be a convolutional neural network for a mere motion cost function, an input layer, a convolutional layer, a fully-connected layer, and an output layer; it may also be a dual delay deep Q network, including an action value network for selecting actions
And a target value network for calculating the Q value
Wherein the parameter ω ═ ω
1,…,ω
i,…,ω
N]Expressed as N parameters of the motion value network, parameter ω '═ ω'
1,…,ω′
i,…,ω′
N]Expressed as N parameters of the target value network.
Step S600: and inputting the traffic state into a policy network to obtain the probability value of each signal control scheme, and randomly sampling according to the probability value of each signal control scheme to obtain a signal control scheme. In the embodiment of the invention, the strategy network is a pre-constructed convolutional neural network and is used for approximating the strategy function. In an embodiment of the present invention, a policy network includes an input layer, a convolutional layer, a fully-connected layer, and an output layer.
Step S700: and updating parameters of the strategy network through random strategy gradients based on the evaluation value of each signal control scheme in the traffic state and one signal control scheme.
In the embodiment of the present invention, the traffic state is represented as the maximum number of queued vehicles for each phase j at each signalized intersection, specifically:
in the formula (I), the compound is shown in the specification,
representing an observed value of a traffic state of a signalized intersection i at a decision time t, wherein i represents the number of each signalized intersection, and belongs to {1, 2.. N }; j is represented as a phase number, j ∈ {1, 2.. n }; t is decision time; l is a lane number; l is
jIndicating a set of lanes that can be traversed at phase j; q. q.s
t,lExpressed as the number of vehicles in line on lane i at decision time t;
the number of the vehicles in the line I at the decision time t is equal to the number of the vehicles in the line I at the decision time t-1 plus or minus the number of the vehicles entering the line or leaving the line at the decision time t, and specifically comprises the following steps:
in the formula, qt-1,lExpressed as the number of vehicles in line, V, on lane l at decision time t-1t,lSet of vehicles representing a drive into a lane l at a decision time tAnd v represents whether a certain vehicle is added into the queue at the decision time t, and the judgment of the vehicle adding into the queue or leaving the queue is specifically as follows:
in the formula (I), the compound is shown in the specification,
the speed of the vehicle v at the decision time t-1 and the decision time t; sp
ThrA speed threshold for determining whether to join the queue;
the joint state of the signalized intersections is expressed as a vector of the observation value of each signalized intersection, and specifically comprises the following steps:
in the formula (I), the compound is shown in the specification,
expressed as the observed value at decision time t for the ith signalized intersection.
In the embodiment of the present invention, the signal control scheme is divided into action selection of a fixed phase order and action selection of a variable phase order according to whether the order of the phases is changed;
for a fixed phase order, the policy network at signalized intersection i has two optional actions in the decision time t of each iteration, when
Time indicates continuation of the current phase; when in use
The time indicates ending the current phase and switching to the next phase, specifically:
for a variable phase sequence, the policy network at signalized intersection i has n selectable actions at decision time t of each iteration, specifically:
if at decision time t, the strategy network at the signalized intersection i decides to continue the current phase, namely
Namely, the action at the moment t is the same as the action at the last decision moment p (t), the duration m seconds of the current phase is prolonged, and m is 1 to 5 seconds; the decision time for judging whether to switch the phase at the next time is the time t + m, if the current phase is decided to be ended and switched to the next phase at the time t, the current phase is switched to the next phase
That is, the action at the time t is different from the action at the last decision time p (t), and finally a minimum green time G of the phase is set
minIf the time Y of the intermediate phase yellow lamp is released again, the phase is switched to the next phase, and the decision time for judging whether the phase is switched next time is t + G
minTime + Y + m.
In the embodiment of the invention, an evaluation value calculation function of the value network at the signalized intersection i is defined as a reduction value of the maximum number of queued vehicles, and specifically comprises the following steps:
in the formula, Lp(t),lAnd Lt,lThe number of vehicles queued on lane l at decision time p (t) and decision time t, respectively;
or, defining an evaluation value calculation function of the value network at the signalized intersection i as a reduction value of the maximum total delay, specifically:
in the formula, Cdp(t),vAnd Cdt,vRespectively accumulating the total delay of the queued vehicles at decision time p (t) and decision time t;
the joint evaluation value calculation function of the value network at a plurality of signal intersections is expressed as the coupling of the evaluation value calculation function of the value network at each signal intersection, and specifically comprises the following steps:
wherein J (i) is a set of value networks other than the value network of the signalized intersection i; for the joint reward function, n is a non-negative constant, when n is 0, the value network of the signalized intersection i only considers the evaluation values of other value network sets J (i), and when n is larger, the value network of the signalized intersection i only considers the local evaluation value of the signalized intersection i.
In the embodiment of the invention, when the value network is a double-delay deep Q network, the specific steps of training the value network and the strategy network are as follows:
1) inputting reinforcement learning related parameters: the method comprises the following steps of empirical pool capacity max _ size, small batch size batch _ size, discount rate gamma, action value network learning rate alpha, target value network learning rate beta, strategy network learning rate eta and iteration termination number N.
In the embodiment of the present invention, the specific values of the parameters may be specifically set according to the needs of the actual application scenario and the experience of the user, and here, a set of specific parameter values is provided to help those skilled in the art understand the technical solution: the empirical pool size max _ size is set to 100,000; the small batch size batch _ size is set to 32; discount rate γ is set to 0.75; the value network learning rate alpha is set to 0.0002; setting the target value network learning rate beta to be 0.001, setting the strategy network learning rate eta to be 0.0002 and setting the number N of termination iterations to be 450,000; specifically, the following table shows:
2) initializing element and action value network in experience pool E
Parameter omega, target value network of
Parameter ω' of (1), policy network
The parameter θ of (a);
3) obtaining the observed value of each signalized intersection i at the moment t
Formed united traffic state s
tCurrent _ phase of the current phase;
4) when the iteration number i is less than the termination iteration number N, executing the following steps:
41) network according to policy
Calculating probability distribution, and randomly sampling according to the probability distribution to obtain signal control scheme
42) On-signal control scheme
When the current _ phase state is the current _ phase state, the current phase duration is prolonged by m seconds; on-signal control scheme
Is not currentWhen the phase state current _ phase is in, a section of minimum green time G of the current phase is released
minAfter the time Y of the intermediate phase yellow light is finished, switching to the jth phase;
43) calculating evaluation value of evaluation network at signalized intersection i
And constructing a joint evaluation value calculation function
And calculating the observed value of each signalized intersection i at the moment t +1
Formed united traffic state s
t+1;
44) When the capacity of the experience pool E is the maximum capacity max _ size, removing the earlier experience at the moment from the experience pool E; otherwise, experience will be
Putting the obtained product into an experience pool E;
45) when the capacity of the experience pool is larger than the small batch experience quantity batch _ size, the following steps are executed:
451) randomly sampling small batches from an experience pool E according to the experience priority value;
452) for each small batch of experience samples
Network for respectively calculating action value at signalized intersection i
Value of (D) and value of target value network
And obtaining the size of a baseline b value;
453) according to
Calculating the value of the loss function and using the Adam optimizer gradient descent method
Minimizing a loss function to update the parameter ω;
454) updating the target value network according to ω '═ β ω' + (1- β) ω
Parameter ω';
455) for each small batch of experience samples
Computing policy network based on Monte Carlo approximation method
Random strategy gradient of
And using a gradient ascent algorithm
Updating the parameter theta;
46) the traffic state s at the moment t +1 is calculatedt+1Is assigned to stAnd repeats step 451) to step 455).
Example 2
Fig. 4 is a schematic block diagram of a traffic signal control system based on a random policy gradient according to an embodiment of the present invention, which may be used to implement the traffic signal control method based on a random policy gradient according to embodiment 1 or any alternative implementation manner thereof. As shown in fig. 4, the system includes: the simulation system comprises a first data acquisition module 10, a simulation drawing module 20, a second data acquisition module 30, a simulation checking module 40, an action evaluation module 50, an action sampling module 60 and a signal control module 70. Wherein the content of the first and second substances,
the first data acquisition module 10 is configured to acquire static road network data of at least one control signalized intersection.
The simulation drawing module 20 is used for drawing the traffic simulation road network according to the static road network data in a visualized manner.
The second data acquiring module 30 is configured to acquire real-time traffic operation state data of at least one control signalized intersection.
The simulation checking module 40 is configured to perform parameter checking on simulation parameters in the traffic simulation road network according to the traffic operation state data, so as to obtain an optimized traffic simulation road network.
The action evaluation module 50 is configured to input the traffic state obtained by observing the optimized traffic simulation road network into the value network, obtain an evaluation value of each signal control scheme in the traffic state, and update parameters of the value network by using a time difference algorithm. In the embodiment of the invention, the value network is a pre-constructed neural network used for approximating the action value function.
The action sampling module 60 is configured to input the traffic status into the policy network, obtain a probability value of each signal control scheme, and perform random sampling according to the probability value of each signal control scheme to obtain one signal control scheme. In the embodiment of the invention, the strategy network is a pre-constructed neural network and is used for approximating the strategy function.
The signal control module 70 is configured to update parameters of the policy network through a random policy gradient based on the evaluation value of each signal control scheme in the traffic state and one signal control scheme.
An embodiment of the present invention further provides an electronic device, as shown in fig. 5, the electronic device may include a processor 51 and a memory 52, where the processor 51 and the memory 52 may be connected by a bus or in another manner, and fig. 5 takes the connection by the bus as an example.
The processor 51 may be a Central Processing Unit (CPU). The Processor 51 may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, or combinations thereof.
The memory 52 is a non-transitory computer readable storage medium, and can be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the traffic signal control method based on the random strategy gradient in the embodiment of the present invention (e.g., the first data acquisition module 10, the simulation rendering module 20, the second data acquisition module 30, the simulation checking module 40, the action evaluation module 50, the action sampling module 60, and the signal control module 70 shown in fig. 4). The processor 51 executes various functional applications and data processing of the processor by running non-transitory software programs, instructions and modules stored in the memory 52, namely, implements the traffic signal control method based on the random strategy gradient in the above method embodiment.
The memory 52 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created by the processor 51, and the like. Further, the memory 52 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 52 may optionally include memory located remotely from the processor 51, and these remote memories may be connected to the processor 51 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The one or more modules are stored in the memory 52 and, when executed by the processor 51, perform a traffic signal control method based on random policy gradients as in the embodiments shown in fig. 1-3.
The details of the electronic device may be understood by referring to the corresponding descriptions and effects in the embodiments shown in fig. 1 to fig. 3, and are not described herein again.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD), a Solid State Drive (SSD), or the like; the storage medium may also comprise a combination of memories of the kind described above.
It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications therefrom are within the scope of the invention.