CN112614343B

CN112614343B - Traffic signal control method and system based on random strategy gradient and electronic equipment

Info

Publication number: CN112614343B
Application number: CN202011459044.2A
Authority: CN
Inventors: 郑培余; 陶刚; 陈波; 李志斌; 陈冰; 杨光
Original assignee: Duolun Technology Corp ltd
Current assignee: Duolun Technology Corp ltd
Priority date: 2020-12-11
Filing date: 2020-12-11
Publication date: 2022-08-19
Anticipated expiration: 2040-12-11
Also published as: WO2022121510A1; CN112614343A

Abstract

The invention discloses a traffic signal control method, a system and electronic equipment based on a random strategy gradient, wherein the method comprises the following steps: obtaining static road network data of at least one control signalized intersection; visually drawing a traffic simulation road network according to the static road network data; acquiring real-time traffic running state data of at least one control signalized intersection; performing parameter checking on simulation parameters in the traffic simulation road network according to the traffic running state data to obtain an optimized traffic simulation road network; inputting the traffic state obtained by observing the optimized traffic simulation road network into a value network to obtain an evaluation value of each signal control scheme under the traffic state; inputting the traffic state into a policy network to obtain a probability value of each signal control scheme; and updating parameters of the strategy network through random strategy gradients based on the evaluation value of each signal control scheme in the traffic state and one signal control scheme. The method provided by the invention can solve the problem of dimension explosion of signal control.

Description

Traffic signal control method and system based on random strategy gradient and electronic equipment

Technical Field

The invention relates to the technical field of intelligent traffic, in particular to a traffic signal control method, a system and electronic equipment based on a random strategy gradient.

Background

In order to meet the rapid increase of urban traffic demand, not only road infrastructure needs to be newly built to improve the overall traffic capacity of a city, but also the existing traffic infrastructure of the city needs to be improved, and the traffic efficiency of the existing road is improved through an intelligent traffic management and control technology. The intersection is used as a key node of an urban road network and is also one of research hotspots for urban intelligent traffic control. The urban intelligent traffic control system takes the intersection as a real-time controllable system, and continuously monitors, diagnoses, models and controls the intersection in real time. However, the traditional intersection signal control system based on the fixed timing scheme cannot adapt to the nonlinearity, randomness, ambiguity and uncertainty of a traffic system.

The self-adaptive traffic signal control system can respond to dynamic change of traffic flow in time, and optimize a signal timing scheme for real-time control. However, the existing adaptive traffic signal control system has the following limitations: (1) the problem of dimension explosion of a plurality of intersections is solved simultaneously; (2) lack of an accurate traffic model framework to represent the dynamics and randomness of traffic flow in response to optimal changes in signal control; (3) the failure of the detector and the communication failure greatly affect the stability of the system.

Reinforcement learning is an unsupervised machine learning method, and can directly interact with a traffic simulation road network to learn a control strategy. The intelligent agent obtains a state by observing the simulation road network, selects an action from the action set based on the strategy function, after the action is executed, the simulation road network feeds back a reward supervision signal for evaluating the quality of the selected action, meanwhile, the simulation road network is updated to the next state, and the intelligent agent repeats the process until one epigode is finished to obtain the maximum accumulated reward. Therefore, the adaptive traffic signal control based on reinforcement learning can adapt to the dynamic property and the randomness of a traffic system, and has obvious advantages compared with the traditional signal control and induction signal control based on a fixed timing scheme. However, the actions of the traditional reinforcement learning such as Q learning are selected based on the Q value in the Q table, and the method has the disadvantages that only limited state-action pairs can be processed, a huge state space cannot be processed, and the problem of dimension explosion occurs due to the overlarge state space, so that the efficiency of strategy learning is low, the accuracy is low, and the like.

Disclosure of Invention

The invention provides a traffic signal control method, system and electronic equipment based on random strategy gradients, and aims to solve the problems that in the prior art, limited state-action pairs can be processed by using traditional reinforcement learning to control traffic signals, huge state space cannot be processed, and dimension explosion can occur due to overlarge state space, so that the efficiency of strategy learning is low and the accuracy is low.

According to a first aspect, the invention provides a traffic signal control method based on a random strategy gradient, comprising the following steps:

obtaining static road network data of at least one control signalized intersection;

visually drawing a traffic simulation road network according to the static road network data;

acquiring real-time traffic running state data of at least one control signalized intersection;

performing parameter checking on simulation parameters in the traffic simulation road network according to the traffic operation state data to obtain an optimized traffic simulation road network;

inputting the traffic state obtained by observing the optimized traffic simulation road network into a value network to obtain an evaluation value of each signal control scheme under the traffic state, and updating parameters of the value network by adopting a time difference algorithm; the value network is a pre-constructed convolutional neural network and is used for approximating an action value function;

inputting the traffic state into a policy network to obtain a probability value of each signal control scheme, and randomly sampling according to the probability value of each signal control scheme to obtain a signal control scheme; the strategy network is a pre-constructed convolutional neural network and is used for approximating a strategy function;

and updating parameters of the strategy network through random strategy gradients based on the evaluation value of each signal control scheme in the traffic state and one signal control scheme.

Optionally, the step of obtaining the optimized traffic simulation road network by checking the simulation parameters in the traffic simulation road network according to the traffic operation state data, where the traffic operation state data includes a headway and an acceleration/deceleration of a vehicle, includes:

acquiring the value ranges of the actual vehicle headway time distance parameter and the vehicle acceleration and deceleration parameter, and preliminarily checking the vehicle headway time distance parameter and the vehicle acceleration and deceleration parameter in the traffic simulation road network according to the value ranges;

observing a traffic simulation road network, acquiring a simulation headway and a vehicle acceleration and deceleration, and comparing and analyzing the simulation headway and the vehicle acceleration and deceleration with an actual headway and the vehicle acceleration and deceleration;

if the difference between the simulated headway and the vehicle acceleration and deceleration and the difference between the actual headway and the vehicle acceleration and deceleration are within a preset range, ending parameter checking to obtain an optimized traffic simulation road network; otherwise, repeating the steps until the difference is within the preset range.

Optionally, the static road network data includes part or all of road grade, number of lanes, lane width, lane function division, road length, road marking, intersection type, adjacent intersection information, signal equipment number, phase information, and phase sequence information;

the traffic operation state data may further include a device ID, a detection time, a traffic volume, a vehicle type distribution, a vehicle time occupancy, a vehicle space occupancy, a vehicle speed, a vehicle length, a vehicle head interval, a queuing length, and a number of times of parking, in part or in whole.

Optionally, the traffic state is expressed as a maximum number of queued vehicles per phase j at each signalized intersection, specifically:

in the formula (I), the compound is shown in the specification,

representing an observed value of a traffic state of a signalized intersection i at a decision time t, wherein i represents the number of each signalized intersection, and belongs to {1, 2.. N }; j is represented as a phase number, j ∈ {1, 2.. n }; t is decision time; l is a lane number; l is _j Indicating a set of lanes that can be traversed at phase j; q. q.s _t,l Expressed as the number of vehicles in line on lane i at decision time t;

the number of the vehicles in the line I at the decision time t is equal to the number of the vehicles in the line I at the decision time t-1 plus or minus the number of the vehicles entering the line or leaving the line at the decision time t, and specifically comprises the following steps:

in the formula, q _t-1,l Expressed as the number of vehicles in line, V, on lane l at decision time t-1 _t,l The method comprises the following steps of representing a vehicle set which enters a lane l at a decision time t, representing whether a certain vehicle is added into a queue at the decision time t, judging whether the vehicle is added into the queue or leaves the queue, and specifically comprising the following steps:

in the formula (I), the compound is shown in the specification,

the speed of the vehicle v at the decision time t-1 and the decision time t; sp ^Thr A speed threshold for determining whether to join the queue;

the joint state of a plurality of signalized intersections is expressed as a vector of the observed value of each signalized intersection, and specifically comprises the following steps:

in the formula (I), the compound is shown in the specification,

expressed as the observed value of the ith signalized intersection at decision time t.

Alternatively, the signal control scheme is divided into a fixed phase order action selection and a variable phase order action selection according to whether the order of the phases is changed;

for a fixed phase order, the policy network at signalized intersection i has two optional actions in the decision time t of each iteration, when

Time indicates continuation of the current phase; when in use

The time indicates ending the current phase and switching to the next phase, specifically:

for a variable phase sequence, the policy network at signalized intersection i has n selectable actions at decision time t of each iteration, specifically:

if at decision time t, the strategy network at the signalized intersection i decides to continue the current phase, namely

The action at the moment t is the same as the action at the previous decision moment p (t), the duration m seconds of the current phase is prolonged, and m is 1 to 5 seconds; the decision time for judging whether to switch the phase at the next time is the time t + m, if the current phase is decided to be ended and switched to the next phase at the time t, the method comprises the steps of

That is, the action at the time t is different from the action at the last decision time p (t), and finally a minimum green time G of the phase is set _min If the time Y of the yellow lamp of the intermediate phase is released again, the switching to the next phase is started, and the decision time for judging whether the phase is switched or not next time is t + G _min Time + Y + m.

Optionally, the evaluation value calculation function of the value network at the signalized intersection i is defined as a reduction value of the maximum number of queued vehicles, and specifically includes:

in the formula, L _p(t),l And L _t,l The number of vehicles queued on lane l at decision time p (t) and decision time t, respectively;

or, defining an evaluation value calculation function of the value network at the signalized intersection i as a reduction value of the maximum total delay, specifically:

in the formula, Cd _p(t),v And Cd _t,v Respectively accumulating the total delay of the queued vehicles at decision time p (t) and decision time t;

the joint evaluation value calculation function of the value network at the plurality of signal intersections is expressed as the coupling of the evaluation value calculation functions of the value network at each signal intersection, and specifically comprises the following steps:

wherein J (i) is a set of value networks other than the value network of the signalized intersection i; for the joint reward function, n is a non-negative constant, when n is 0, the value network of the signalized intersection i only considers the evaluation values of the other value network sets J (i), and when n is larger, the value network of the signalized intersection i only considers the local evaluation value of the signalized intersection i.

Optionally, the value network is a dual delay deep Q network, comprising an action value network for selecting an action

And a target value network for calculating the Q value

Parameter ω ═ ω ₁ ,…,ω _i ,…,ω _N ]Expressed as N parameters of the motion value network, parameter ω '═ ω' ₁ ,…,ω′ _i ,…,ω′ _N ]N parameters expressed as a target value network;

training a value network and a strategy network, comprising the steps of:

1) inputting reinforcement learning related parameters: experience pool capacity max _ size, small batch size batch _ size, discount rate gamma, action value network learning rate alpha, target value network learning rate beta, strategy network learning rate eta and termination iteration number N;

2) initializing element, action value network in experience pool E

Parameter omega, target value network of

Parameter ω' of (1), policy network

The parameter θ of (a);

3) obtaining the observed value of each signalized intersection i at the moment t

Formed united traffic state s _t Current _ phase of the current phase;

4) when the iteration number i is less than the termination iteration number N, executing the following steps:

41) network according to policy

Calculating probability distribution, and randomly sampling according to the probability distribution to obtain signal control scheme

42) Control scheme for signal

When the current phase state is current _ phase, the current phase is extendedThe time length is m seconds; on-signal control scheme

When the current _ phase state is not in the current phase state, releasing a section of minimum green time G of the current phase _min After the intermediate phase yellow light time Y is finished, switching to the jth phase;

43) calculating evaluation value of evaluation network at signalized intersection i

And constructing a joint evaluation value calculation function

And calculating the observed value of each signalized intersection i at the moment t +1

Formed united traffic state s _t+1 ；

44) When the capacity of the experience pool E is the maximum capacity max _ size, removing the earlier experience at the moment from the experience pool E; otherwise, experience will be

Putting the mixture into an experience pool E;

45) when the capacity of the experience pool is larger than the small batch experience quantity batch _ size, the following steps are executed:

451) randomly sampling small batches from an experience pool E according to the experience priority value;

452) for each small batch of experience samples

Network for respectively calculating action value at signalized intersection i

Value of (C) and value of target value network

And obtaining the size of a baseline b value;

453) according to

Calculating the value of the loss function and using the Adam optimizer gradient descent method

Minimizing a loss function to update the parameter ω;

454) updating the target value network according to ω '═ β ω' + (1- β) ω

Parameter ω';

455) for each small lot of experience samples

Computing policy network based on Monte Carlo approximation method

Random strategy gradient of

And using a gradient ascent algorithm

Updating the parameter theta;

46) the traffic state s at the moment of t +1 _t+1 Is assigned to s _t And repeats step 451) to step 455).

According to a second aspect, the present invention provides a traffic signal control system based on a stochastic strategy gradient, comprising:

the first data acquisition module is used for acquiring static road network data of at least one control signalized intersection;

the simulation drawing module is used for drawing a traffic simulation road network according to the static road network data in a visualized manner;

the second data acquisition module is used for acquiring real-time traffic running state data of at least one control signalized intersection;

the simulation checking module is used for performing parameter checking on simulation parameters in the traffic simulation road network according to the traffic running state data to obtain an optimized traffic simulation road network;

the action evaluation module is used for inputting the traffic state obtained by observing the optimized traffic simulation road network into the value network to obtain the evaluation value of each signal control scheme under the traffic state, and updating the parameters of the value network by adopting a time difference algorithm; the value network is a pre-constructed neural network and is used for approximating an action value function;

the action sampling module is used for inputting the traffic state into the policy network to obtain the probability value of each signal control scheme and randomly sampling according to the probability value of each signal control scheme to obtain a signal control scheme; the strategy network is a pre-constructed neural network and is used for approximating a strategy function;

and the signal control module is used for updating parameters of the strategy network through random strategy gradients based on the evaluation value of each signal control scheme and one signal control scheme under the traffic state.

According to a third aspect, the invention provides an electronic device comprising: the traffic signal control method based on the random strategy gradient comprises a memory and a processor, wherein the memory and the processor are mutually connected in a communication mode, the memory stores computer instructions, and the processor executes the computer instructions so as to execute the traffic signal control method based on the random strategy gradient.

According to a fourth aspect, the present invention provides a computer-readable storage medium having stored thereon computer instructions for causing a computer to execute the random policy gradient-based traffic signal control method of the first aspect.

The invention has the beneficial effects that:

1. compared with the prior art in which the parameterized representation of the Q value function is adopted, the parameterized representation of the strategy function is simpler, the convergence is better, the learning efficiency and the accuracy are higher, and the problem of dimension explosion generally does not occur.

2. The traffic signal control method based on the random strategy gradient provided by the invention can adapt to the nonlinearity, randomness, fuzziness and uncertainty of a traffic system by continuously monitoring, diagnosing, modeling and controlling the intersection in real time.

3. According to the traffic signal control method based on the random strategy gradient, the problem that original traffic data and traffic states are too large can be solved by adopting the convolutional neural network in deep learning, the convolutional neural network inputs original high-dimensional data, the bottom-layer features are combined to form more abstract high-layer features, hidden features in the high-dimensional traffic states are captured, control can be directly carried out according to the input high-dimensional data, the feature representation capability of a state input matrix is improved, and the generalization capability of the method on representation of different traffic states is enhanced.

4. Compared with the traditional timing control and induction control, the traffic signal control method based on the random strategy gradient can respond to the dynamic change of the traffic flow in time, optimize the signal timing scheme for real-time control, finally reduce the driving delay of the road network and improve the traffic efficiency of the road network.

Drawings

FIG. 1 is a flow chart of a traffic signal control method based on a stochastic strategy gradient provided by the present invention;

FIG. 2 is a schematic illustration of an exemplary traffic network in an embodiment of the present invention;

FIG. 3 is a flowchart illustrating the detailed steps of step S400 in FIG. 1;

FIG. 4 is a functional block diagram of a traffic signal control system based on stochastic strategy gradients provided by the present invention;

fig. 5 is a schematic diagram of a hardware structure of an electronic device according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the description of the present invention, it should be noted that the terms "first", "second", and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

Example 1

Fig. 3 shows a flow chart of a traffic signal control method based on a random strategy gradient according to an embodiment of the present invention, and as shown in fig. 3, the method may include the following steps:

step S100: and acquiring static road network data of at least one control signalized intersection.

In an embodiment of the present invention, the static road network data includes part or all of road grade, number of lanes, lane width, lane function division, road length, road marking, intersection type, adjacent intersection information, signal device number, phase information, and phase sequence information.

Step S200: and visually drawing a traffic simulation road network according to the static road network data.

In the embodiment of the invention, microscopic traffic simulation software, such as SUMO, can be used for drawing a traffic simulation road network.

In an embodiment of the invention, a traffic simulation road network comprises at least one control signalized intersection. Specifically, as shown in fig. 2, reference numerals 1 to 9 in the drawing denote 9 signal intersections, and the whole shown in fig. 2 is a traffic network.

Step S300: and acquiring real-time traffic running state data of at least one control signalized intersection.

In the embodiment of the present invention, the traffic operation state data may further include part or all of the device ID, the detection time, the traffic flow, the vehicle type distribution, the vehicle time occupancy, the headway, the vehicle acceleration and deceleration, the vehicle space occupancy, the vehicle speed, the vehicle length, the headway, the queuing length, and the parking number.

Step S400: and performing parameter checking on simulation parameters in the traffic simulation road network according to the traffic operation state data to obtain the optimized traffic simulation road network.

In the embodiment of the present invention, one or more of the traffic state data may be used to perform parameter checking on simulation parameters in a traffic simulation road network, where an example using a headway and an acceleration/deceleration of a vehicle is described here, as shown in fig. 3, step S400 may include the following steps:

step S401: and acquiring the value ranges of the actual headway time distance parameter and the vehicle acceleration and deceleration parameter, and preliminarily checking the headway time distance parameter and the vehicle acceleration and deceleration parameter in the traffic simulation road network according to the value ranges.

Step S402: and observing the traffic simulation road network, acquiring the simulation headway and the vehicle acceleration and deceleration, and comparing and analyzing the simulation headway and the vehicle acceleration and deceleration with the actual headway and the actual acceleration and deceleration.

Step S403: and if the difference between the simulated headway and the vehicle acceleration and deceleration and the difference between the actual headway and the vehicle acceleration and deceleration are within a preset range, finishing parameter checking to obtain the optimized traffic simulation road network.

And if the difference between the simulated headway and the vehicle acceleration and deceleration and the difference between the actual headway and the vehicle acceleration and deceleration are not within the preset range, repeating the steps S401 to S403 until the difference is within the preset range.

Step S500: and inputting the traffic state obtained by observing the optimized traffic simulation road network into the value network to obtain the evaluation value of each signal control scheme under the traffic state, and updating the parameters of the value network by adopting a time difference algorithm. In the embodiment of the invention, the value network is a pre-constructed convolutional neural network and is used for approximating the action value function.

In embodiments of the invention, the value network may be a volume for action-only cost functionsThe device comprises a neural network, an input layer, a convolutional layer, a full connection layer and an output layer; it may also be a dual delay deep Q network, including an action value network for selecting actions

And a target value network for calculating the Q value

Wherein the parameter ω ═ ω ₁ ,…,ω _i ,…,ω _N ]Expressed as N parameters of the motion value network, parameter ω '═ ω' ₁ ,…,ω′ _i ,…,ω′ _N ]Expressed as N parameters of the target value network.

Step S600: and inputting the traffic state into a policy network to obtain the probability value of each signal control scheme, and randomly sampling according to the probability value of each signal control scheme to obtain a signal control scheme. In the embodiment of the invention, the strategy network is a pre-constructed convolutional neural network and is used for approximating the strategy function. In an embodiment of the present invention, a policy network includes an input layer, a convolutional layer, a fully-connected layer, and an output layer.

Step S700: and updating parameters of the strategy network through random strategy gradients based on the evaluation value of each signal control scheme in the traffic state and one signal control scheme.

In the embodiment of the present invention, the traffic state is represented as the maximum number of queued vehicles for each phase j at each signalized intersection, specifically:

in the formula (I), the compound is shown in the specification,

in the formula, q _t-1,l Expressed as the number of vehicles in line, V, on lane l at decision time t-1 _t,l The method comprises the following steps of representing a vehicle set which enters a lane l at a decision moment t, representing whether a certain vehicle is added into a queue at the decision moment t, judging whether the vehicle is added into the queue or leaves the queue, and specifically:

in the formula (I), the compound is shown in the specification,

the joint state of the signalized intersections is expressed as a vector of the observation value of each signalized intersection, and specifically comprises the following steps:

in the formula (I), the compound is shown in the specification,

expressed as the observed value at decision time t for the ith signalized intersection.

In the embodiment of the present invention, the signal control scheme is divided into action selection of a fixed phase order and action selection of a variable phase order according to whether the order of the phases is changed;

Time indicates continuation of the current phase; when the temperature is higher than the set temperature

That is, the action at the time t is different from the action at the last decision time p (t), and finally, a minimum green light time G of the phase is set _min If the time Y of the intermediate phase yellow lamp is released again, the phase is switched to the next phase, and the decision time for judging whether the phase is switched next time is t + G _min Time + Y + m.

In the embodiment of the invention, an evaluation value calculation function of the value network at the signalized intersection i is defined as a reduction value of the maximum number of queued vehicles, and specifically comprises the following steps:

the joint evaluation value calculation function of the value network at a plurality of signal intersections is expressed as the coupling of the evaluation value calculation function of the value network at each signal intersection, and specifically comprises the following steps:

In the embodiment of the invention, when the value network is a double-delay deep Q network, the specific steps of training the value network and the strategy network are as follows:

1) inputting reinforcement learning related parameters: the method comprises the following steps of empirical pool capacity max _ size, small batch size batch _ size, discount rate gamma, action value network learning rate alpha, target value network learning rate beta, strategy network learning rate eta and iteration termination number N.

In the embodiment of the present invention, the specific values of the parameters may be specifically set according to the needs of the actual application scenario and the experience of the user, and here, a set of specific parameter values is provided to help those skilled in the art understand the technical solution: empirical pool size max _ size is set to 100,000; the mini-batch size batch _ size is set to 32; discount rate γ is set to 0.75; the value network learning rate alpha is set to 0.0002; setting the target value network learning rate beta to be 0.001, setting the strategy network learning rate eta to be 0.0002 and setting the number N of termination iterations to be 450,000; specifically, the following table shows:

2) initializing element and action value network in experience pool E

Parameter omega, target value network

Parameter ω' of (1), policy network

The parameter θ of (a);

Formed united traffic state s _t Current _ phase of the current phase;

4) when the iteration number i is less than the termination iteration number N, the following steps are executed:

41) network according to policy

Calculating a probability distribution based on the probability distributionControl scheme for obtaining signal by random sampling

42) On-signal control scheme

When the current phase state is current _ phase, the current phase duration is prolonged by m seconds; on-signal control scheme

When the current _ phase state is not in the current phase state, releasing a section of minimum green time G of the current phase _min After the time Y of the intermediate phase yellow light is finished, switching to the jth phase;

And constructing a joint evaluation value calculation function

Formed united traffic state s _t+1 ；

Putting the mixture into an experience pool E;

452) for each small batch of experience samples

Network for respectively calculating action value at signalized intersection i

Value of (D) and value of target value network

And obtaining the size of a baseline b value;

453) according to

Minimizing a loss function to update the parameter ω;

454) updating the target value network according to ω '═ β ω' + (1- β) ω

Parameter ω';

455) for each small lot of experience samples

Computing policy network based on Monte Carlo approximation method

Random strategy gradient of (2)

And using a gradient ascent algorithm

Updating the parameter theta;

46) the traffic state s at the moment t +1 is calculated _t+1 Is assigned to s _t And repeats steps 451) to 455).

Example 2

Fig. 4 is a schematic block diagram of a traffic signal control system based on a random policy gradient according to an embodiment of the present invention, which may be used to implement the traffic signal control method based on a random policy gradient according to embodiment 1 or any optional implementation thereof. As shown in fig. 4, the system includes: the system comprises a first data acquisition module 10, a simulation drawing module 20, a second data acquisition module 30, a simulation checking module 40, an action evaluation module 50, an action sampling module 60 and a signal control module 70. Wherein the content of the first and second substances,

the first data acquisition module 10 is configured to acquire static road network data of at least one control signalized intersection.

The simulation drawing module 20 is used for visually drawing the traffic simulation road network according to the static road network data.

The second data acquiring module 30 is configured to acquire real-time traffic operation state data of at least one control signalized intersection.

The simulation checking module 40 is configured to perform parameter checking on simulation parameters in the traffic simulation road network according to the traffic operation state data, so as to obtain an optimized traffic simulation road network.

The action evaluation module 50 is configured to input the traffic state obtained by observing the optimized traffic simulation road network into the value network, obtain an evaluation value of each signal control scheme in the traffic state, and update parameters of the value network by using a time difference algorithm. In the embodiment of the invention, the value network is a pre-constructed neural network used for approximating the action value function.

The action sampling module 60 is configured to input the traffic status into the policy network, obtain a probability value of each signal control scheme, and perform random sampling according to the probability value of each signal control scheme to obtain one signal control scheme. In the embodiment of the invention, the strategy network is a pre-constructed neural network and is used for approximating the strategy function.

The signal control module 70 is configured to update parameters of the policy network through a random policy gradient based on the evaluation value of each signal control scheme in the traffic state and one signal control scheme.

An embodiment of the present invention further provides an electronic device, as shown in fig. 5, the electronic device may include a processor 51 and a memory 52, where the processor 51 and the memory 52 may be connected by a bus or in another manner, and fig. 5 takes the connection by the bus as an example.

The processor 51 may be a Central Processing Unit (CPU). The Processor 51 may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, or combinations thereof.

The memory 52, as a non-transitory computer-readable storage medium, may be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as program instructions/modules corresponding to the traffic signal control method based on the stochastic strategy gradient in the embodiment of the present invention (for example, the first data acquisition module 10, the simulation plotting module 20, the second data acquisition module 30, the simulation checking module 40, the action evaluation module 50, the action sampling module 60, and the signal control module 70 shown in fig. 4). The processor 51 executes various functional applications and data processing of the processor by running non-transitory software programs, instructions and modules stored in the memory 52, namely, implements the traffic signal control method based on the random strategy gradient in the above method embodiment.

The memory 52 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created by the processor 51, and the like. Further, the memory 52 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 52 may optionally include memory located remotely from the processor 51, and these remote memories may be connected to the processor 51 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The one or more modules are stored in the memory 52 and, when executed by the processor 51, perform a traffic signal control method based on random policy gradients as in the embodiments shown in fig. 1-3.

The specific details of the electronic device may be understood by referring to the corresponding descriptions and effects in the embodiments shown in fig. 1 to fig. 3, which are not described herein again.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD), a Solid State Drive (SSD), or the like; the storage medium may also comprise a combination of memories of the kind described above.

It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. This need not be, nor should it be exhaustive of all embodiments. And obvious variations or modifications therefrom are within the scope of the invention.

Claims

1. A traffic signal control method based on a random strategy gradient is characterized by comprising the following steps:

acquiring real-time traffic running state data of the at least one control signalized intersection;

updating parameters of the policy network by a random policy gradient based on the evaluation value of each signal control scheme in the traffic state and the one signal control scheme;

the traffic running state data comprises a time headway and vehicle acceleration and deceleration, and the step of performing parameter checking on simulation parameters in the traffic simulation road network according to the traffic running state data to obtain an optimized traffic simulation road network comprises the following steps:

acquiring the value ranges of the actual headway time distance parameter and the vehicle acceleration and deceleration parameter, and preliminarily checking the headway time distance parameter and the vehicle acceleration and deceleration parameter in the traffic simulation road network according to the value ranges;

observing the traffic simulation road network, acquiring a simulation headway time and a vehicle acceleration and deceleration, and comparing and analyzing the simulation headway time and the vehicle acceleration and deceleration with an actual headway time and a vehicle acceleration and deceleration;

if the difference between the simulated headway and the vehicle acceleration and deceleration and the difference between the actual headway and the vehicle acceleration and deceleration are within a preset range, finishing parameter checking to obtain the optimized traffic simulation road network; otherwise, repeating the steps until the difference is within a preset range;

the traffic state is represented as the maximum number of queued vehicles of each phase j at each signalized intersection, and specifically comprises the following steps:

in the formula (I), the compound is shown in the specification,

representing an observed value of a traffic state of a signalized intersection i at a decision time t, wherein i represents the number of each signalized intersection, and belongs to {1, 2.. N }; j is represented as a phase number, j ∈ {1, 2.. n }; t is decision time; l is a lane number; l is a radical of an alcohol _j A lane group that can pass at phase j; q. q.s _t,l Expressed as the number of vehicles in line on lane i at decision time t;

in the formula (I), the compound is shown in the specification,

in the formula (I), the compound is shown in the specification,

the observed value of the ith signalized intersection at the decision time t is expressed;

the signal control scheme is divided into a fixed phase order of action selection and a variable phase order of action selection according to whether to change the phase order;

for a fixed phase order, the policy network at signalized intersection i has two optional actions in the decision time t of each iteration when

Time indicates continuation of the current phase; when in use

for a variable phase sequence, the policy network at signalized intersection i has n selectable actions in the decision time t of each iteration, specifically:

if at decision time t, the strategy network at signalized intersection i decides to continue the current phase, that isIs provided with

Namely, the action at the moment t is the same as the action at the last decision moment p (t), the duration m seconds of the current phase is prolonged, and m is 1 to 5 seconds; the decision time for judging whether to switch the phase at the next time is the time t + m, if the current phase is decided to be ended and switched to the next phase at the time t, the current phase is switched to the next phase

That is, the action at the time t is different from the action at the last decision time p (t), and finally a minimum green time G of the phase is set _min If the time Y of the intermediate phase yellow lamp is released again, the phase is switched to the next phase, and the decision time for judging whether the phase is switched next time is t + G _min Time + Y + m;

an evaluation value calculation function of the value network at the signalized intersection i is defined as a reduction value of the maximum number of queued vehicles, and specifically comprises the following steps:

2. The random policy gradient-based traffic signal control method according to claim 1, wherein the static road network data comprises part or all of road classes, number of lanes, lane widths, lane functional divisions, road segment lengths, road markings, intersection types, adjacent intersection information, signal device numbers, phase information, and phase sequence information;

the traffic running state data further comprises part or all of the equipment ID, the detection time, the traffic flow, the vehicle type distribution, the vehicle time occupancy, the vehicle space occupancy, the vehicle speed, the vehicle length, the vehicle head distance, the queuing length and the parking times.

3. The method of claim 1, wherein the value network is a dual delay depth Q network comprising an action value network for selecting an action

And a target value network for calculating the Q value

training the value network and the strategy network, comprising the steps of:

2) initializing element, action value network in experience pool E

Parameter omega, target value network of

Parameter ω' of (C), policy network

The parameter θ of (a);

Formed united traffic state s _t Current _ phase of the current phase;

41) network according to policy

Calculating probability distribution, and randomly sampling to obtain signal control scheme based on the probability distribution

42) When the signal controls the scheme

For the current phaseWhen the current _ phase is in the state, the current phase duration is prolonged by m seconds; when the signal controls the scheme

When current _ phase is not in the current phase state, the minimum green time G of the current phase is released _min After the intermediate phase yellow light time Y is finished, switching to the jth phase;

And constructing a joint evaluation value calculation function

Formed united traffic state s _t+1 ；

Putting the mixture into an experience pool E;

452) for each small lot of experience samples

Network for respectively calculating action value at signalized intersection i

Value of (D) and target valueValue of the network

And obtaining the value of the baseline b;

453) according to

Minimizing a loss function to update the parameter ω;

454) updating the target value network according to ω '═ β ω' + (1- β) ω

Parameter ω';

455) for each small batch of experience samples

Computing policy network based on Monte Carlo approximation method

Random strategy gradient of

And using a gradient ascent algorithm

Updating the parameter theta;

4. A traffic signal control system based on a stochastic strategy gradient, comprising:

the second data acquisition module is used for acquiring real-time traffic running state data of the at least one control signalized intersection;

the simulation checking module is used for performing parameter checking on the simulation parameters in the traffic simulation road network according to the traffic running state data to obtain an optimized traffic simulation road network;

the action evaluation module is used for inputting the traffic state obtained by observing the optimized traffic simulation road network into a value network to obtain the evaluation value of each signal control scheme under the traffic state, and updating the parameters of the value network by adopting a time difference algorithm; the value network is a pre-constructed neural network and is used for approximating an action value function;

the action sampling module is used for inputting the traffic state into a policy network to obtain the probability value of each signal control scheme and randomly sampling according to the probability value of each signal control scheme to obtain a signal control scheme; the strategy network is a pre-constructed neural network and is used for approximating a strategy function;

the signal control module is used for updating parameters of the strategy network through random strategy gradients based on the evaluation value of each signal control scheme in the traffic state and the signal control scheme;

the traffic running state data comprises a vehicle head time interval and vehicle acceleration and deceleration, and the simulation checking module comprises:

observing the traffic simulation road network, acquiring a simulation headway and a vehicle acceleration and deceleration, and comparing and analyzing the simulation headway and the vehicle acceleration and deceleration with an actual headway and an actual vehicle acceleration and deceleration;

if the difference between the simulated headway and the vehicle acceleration and deceleration and the difference between the actual headway and the vehicle acceleration and deceleration are within a preset range, ending parameter checking to obtain the optimized traffic simulation road network; otherwise, repeating the steps until the difference is within a preset range;

the traffic state is expressed as the maximum number of queued vehicles of each phase j at each signalized intersection, and specifically comprises the following steps:

in the formula (I), the compound is shown in the specification,

representing an observed value of a traffic state of a signalized intersection i at a decision time t, wherein i represents the number of each signalized intersection, and belongs to {1, 2.. N }; j is represented as a phase number, j ∈ {1, 2.. n }; t is decision time; l is a lane number; l is _j A lane group that is passable at phase j; q. q.s _t,l Expressed as the number of vehicles in line on lane i at decision time t;

in the formula (I), the compound is shown in the specification,

in the formula (I), the compound is shown in the specification,

Time indicates continuation of the current phase; when in use

That is, the action at the time t is different from the action at the last decision time p (t), and finally a minimum green time G of the phase is set _min If the time Y of the intermediate phase yellow lamp is released again, the phase is switched to the next phase, and the decision time for judging whether the phase is switched next time is t + G _min + Y + m time;

in the formula, L _p(t),l And L _t,l Respectively the number of the vehicles which are lined up on the lane l at the decision time p (t) and the decision time t;

in the formula, Cd _p(t),v And Cd _t,v Are respectively in decisionAccumulating the total delay of the queued vehicles at the time p (t) and the decision time t;

5. An electronic device, comprising: a memory and a processor, the memory and the processor being communicatively connected to each other, the memory storing therein computer instructions, and the processor executing the computer instructions to perform the traffic signal control method based on random strategy gradients of any one of claims 1-3.

6. A computer-readable storage medium, characterized in that the computer-readable storage medium stores computer instructions for causing the computer to execute the method for stochastic policy gradient based traffic signal control according to any of claims 1 to 3.