CN116758767B

CN116758767B - Traffic signal lamp control method based on multi-strategy reinforcement learning

Info

Publication number: CN116758767B
Application number: CN202311050477.6A
Authority: CN
Inventors: 邓晓衡; 尹顺梦; 桂劲松; 万少华
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2023-08-21
Filing date: 2023-08-21
Publication date: 2023-10-20
Anticipated expiration: 2043-08-21
Also published as: CN116758767A

Abstract

The invention discloses a traffic signal lamp control method based on multi-strategy reinforcement learning, which comprises the steps of obtaining traffic data information of a target traffic signal lamp at the current moment; complexity judgment is carried out by adopting a classification width learning system; calculating an optimal action value at the next moment by adopting a current evaluation width learning system; acquiring state information and a control strategy of the current moment and the historical moment; training an evaluation width learning system; and repeating the traffic signal lamp control based on multi-strategy reinforcement learning at the target traffic signal lamp in real time. The invention provides a novel traffic signal lamp control method in combination with a width learning system, which not only can realize the control of the traffic signal lamp of the urban intersection, but also has higher reliability, better real-time performance and better accuracy.

Description

Traffic signal lamp control method based on multi-strategy reinforcement learning

Technical Field

The invention belongs to the technical field of traffic control systems, and particularly relates to a traffic signal lamp control method based on multi-strategy reinforcement learning.

Background

With the development of economic technology and the improvement of living standard of people, the problem of traffic jam is more and more serious. Therefore, the method for solving the traffic jam problem is significant.

At present, two solutions are mainly available for relieving the traffic jam problem, namely, the traffic jam problem is relieved by newly building a road and improving an infrastructure; secondly, the traffic signal lamp is controlled through an artificial intelligence scheme to relieve traffic problems.

At present, researchers have proposed a large number of traffic signal control schemes based on artificial intelligence technology to optimize traffic signal control strategies. The traffic signal lamp control scheme based on Model Predictive Control (MPC) inputs data into a predictive model by monitoring road traffic flow in real time, and adjusts signal lamp period and phase according to a predictive result; although the scheme can improve the traffic efficiency to a certain extent, due to the dynamic property and the contingency of urban road traffic, the scheme still has the defect of poor accuracy and reliability. In addition, there are also methods for controlling traffic lights based on real-time traffic information, such as dynamically controlling traffic lights based on deep reinforcement learning; the scheme is interacted with the environment through an intelligent body and trained by means of a deep neural network, so that an optimized control strategy is gradually learned; however, the learning process of this type of scheme is slow and cannot meet the real-time control requirement.

Disclosure of Invention

The invention aims to provide a traffic signal lamp control method based on multi-strategy reinforcement learning, which has high reliability, good real-time performance and good accuracy.

The traffic signal lamp control method based on multi-strategy reinforcement learning provided by the invention comprises the following steps:

s1, acquiring traffic data information at a target traffic signal lamp at the current moment;

s2, according to the data information obtained in the step S1, complexity judgment is carried out by adopting a classification width learning system:

if the system is judged to be a simple system, calculating a control strategy of the traffic signal lamp at the next moment according to the acquired data information, ending the control process of the traffic signal lamp at the current moment, and jumping to the step S4;

if the complex system is determined, continuing the subsequent steps;

s3, calculating an optimal action value at the next moment by adopting a current evaluation width learning system according to the current state information; the traffic signal lamp control process at the current moment is ended, and the step S4 is skipped;

s4, acquiring state information, control strategies and rewarding information of the current moment and the historical moment;

s5, extracting a plurality of pieces of information from the data information obtained in the step S4 to train the assessment width learning system, and taking the trained assessment width learning system as a current assessment width learning system;

and S6, repeating the steps S1-S5 in real time to complete the traffic signal lamp control based on multi-strategy reinforcement learning at the target traffic signal lamp.

The step S2 specifically comprises the following steps:

and (3) according to the data information acquired in the step (S1), adopting a classification width learning system to carry out complexity judgment:

if the system is judged to be a simple system, calculating a control strategy of the traffic signal lamp at the next moment by adopting a Webster algorithm according to the acquired data information; the traffic signal lamp control process at the current moment is ended, and the step S4 is skipped;

if the complex system is determined, the following steps are continued.

The step S3 specifically comprises the following steps:

according to the current state information, a current evaluation width learning system is adopted, and based on the state of the current moment, an optimal action value of the next moment is calculated and obtained, wherein the optimal action value corresponds to a control strategy of a traffic signal lamp;

after the calculation is completed, the traffic light control process at the current moment is ended, and the step S4 is skipped.

The step S3 specifically comprises the following steps:

based on the current state information, a current evaluation width learning system is adoptedThe optimal action value +_for the next moment is calculated by the following formula>：In->For action->A corresponding maximum value;status +.>Lower corresponding action->Q value of (2); />Is the current state; />To evaluate network parameters in a breadth-learning system; optimal action value->Corresponding to the control strategy of the traffic light.

The step S4 specifically comprises the following steps:

at each moment in timeNext, status information +.>Action information->Bonus information->And status information of the current moment->And store in the buffer;

when the memory buffer is full, the earliest stored state information is replaced with the latest stored state information.

The rewarding information is specifically obtained by adopting the following steps:

the rewarding information is obtained by adopting the following calculation formula：/>In->Andis a weight value, and ∈>；/>Is a vehicle average waiting time variable, and +.>，For the total number of waiting vehicles on the road at time t, < > for>For corresponding waiting time of vehicle，/>For the duration of the traffic light in one phase, +.>For the current speed of the vehicle>Minimum for a specified vehicleA speed; />Is a variable of the longest waiting time and the shortest waiting time of the vehicle, and，/>waiting time for the shortest or longest waiting vehicle.

The step S5 specifically comprises the following steps:

before each round of training, a batch of data with the size of a set value P is extracted from a storage buffer area in a uniform sampling mode and is put into a training pool; during each round of training, training data is obtained from a training pool to perform training; after each round of training, taking the evaluation width learning system after the current round of training as a current evaluation width learning system;

first, training is performed for an evaluation width learning system:

in the training data of the current round, the training pool is used for storing the training dataAs input to the systemXTarget value +.>As an output of the systemY；/>For status information data in the last moment of time, < +.>For action information data in the last moment of time, < +.>As training target value information data;

the following algorithm is adopted to inputXMapping to feature space:in->Is the i-th group of characteristic nodes; />A random weight matrix with a set dimension is randomly generated; />Is a randomly generated bias term; />Is a first nonlinear mapping function;

the feature nodes are mapped to obtain enhanced nodes by adopting the following formula:in->Enhancement nodes for the j-th group; />Mapping features for the nth set;a randomly generated random weight matrix for random generation; />Is a randomly generated bias term; />Is a second nonlinear mapping function;nthe number of groups of feature nodes;

connecting the characteristic node and the enhancement node and importing the characteristic node and the enhancement node into an output layer of the system to obtain the output of the systemY：In->Is the m group of enhanced world nodes;mto enhance the group number of nodes; />The connection weight of the network in the system;

if the training of the assessment width learning system does not meet the set requirement, performing incremental learning; the incremental learning comprises adding feature nodes and adding enhancement nodes;

after training for set times, the weight value of the estimated width learning system is copied to the target width learning system to finish updating of the target width learning system.

The target value is calculated by the following formula:in->The rewarding information is rewarding information at the current moment; />Is a discount factor; />Is->Q value corresponding to the action; />Parameters of the system are learned for the target width.

The connection weight of the network in the system is calculated by the following steps:

evaluating nodes of a breadth-learning systemTo express +.>；

The connection weight of the network in the system is calculated by adopting the following formula：In->Is->Is a transpose of (2); />Is a regularization parameter;Iis a unit matrix;Yis the output of the system.

The incremental learning specifically comprises the following steps:

incremental learning is achieved by adding enhanced nodes:

newly added enhanced nodeDenoted as->；

The nodes of the new evaluation width learning system are expressed as；

According to the pseudo-inverse theory of the blocking matrix, the method calculates and obtainsWhereinIs->Is the pseudo-inverse of C, C is the calculated value and +.>，/>Transposed to matrix B and；

new connection weights for the network of the assessment width learning systemIs that；

Incremental learning is realized by adding feature nodes:

newly added feature nodeDenoted as->；

The corresponding added enhancement nodes are randomly generated as follows: a random weight matrix with proper dimension which is randomly generated; />For randomly generated bias terms, then the nodes of the new evaluation width learning system are represented as；

Based onPseudo-inverse pair->Performing incremental learning;

according to the pseudo-inverse theory of the blocking matrix, the method calculates and obtainsWhereinIs->Is the pseudo-inverse of C, C is the calculated value and +.>，/>Transposed of matrix B and +.>；

New evaluation of connection weights of the network of the breadth-learning systemIs->。

The traffic signal lamp control method based on multi-strategy reinforcement learning, provided by the invention, combines a width learning system, provides a novel traffic signal lamp control method, not only can realize the control of urban intersection traffic signal lamps, but also has higher reliability, better real-time performance and better accuracy.

Drawings

FIG. 1 is a schematic flow chart of the method of the present invention.

FIG. 2 is a schematic diagram of a simulation target of an embodiment of the method of the present invention.

Fig. 3 is a schematic state diagram of an embodiment of the method of the present invention.

FIG. 4 is a schematic diagram illustrating the operation of an embodiment of the method of the present invention.

FIG. 5 is a diagram showing the convergence of the method of the present invention compared with the prior art.

FIG. 6 is a schematic diagram showing the comparison of training time of the method of the present invention with that of the prior art.

Detailed Description

A schematic process flow diagram of the method of the present invention is shown in fig. 1: the invention discloses a traffic signal lamp control method based on multi-strategy reinforcement learning, which comprises the following steps:

if the complex system is determined, continuing the subsequent steps;

the method specifically comprises the following steps:

if the complex system is determined, continuing the subsequent steps;

s3, calculating an optimal action value at the next moment by adopting a current evaluation width learning system according to the current state information; the traffic signal lamp control process at the current moment is ended, and the step S4 is skipped; the method specifically comprises the following steps:

after the calculation is completed, the control process of the traffic signal lamp at the current moment is finished, and the step S4 is skipped;

the specific implementation comprises the following contents:

based on the current state information, a current evaluation width learning system is adoptedThe optimal action value +_for the next moment is calculated by the following formula>：In->For action->A corresponding maximum value;status +.>Lower corresponding action->Q value of (2); />Is the current state; />To evaluate network parameters in a breadth-learning system; optimal action value->A control strategy corresponding to the traffic signal;

s4, acquiring state information, control strategies and rewarding information of the current moment and the historical moment; the method specifically comprises the following steps:

when the storage buffer area is full, the latest stored state information is used for replacing the earliest stored state information;

in specific implementation, the reward information is calculated by the following steps:

the rewarding information is obtained by adopting the following calculation formula：/>In->Andis a weight value, and ∈>；/>Is a vehicle average waiting time variable, and +.>，For the total number of waiting vehicles on the road at time t, < > for>For corresponding waiting time of vehicle，/>For the duration of the traffic light in one phase, +.>For the current speed of the vehicle>Is a prescribed minimum speed of the vehicle; />Is a variable of the longest waiting time and the shortest waiting time of the vehicle, and，/>waiting time for the shortest or longest waiting vehicle;

s5, extracting a plurality of pieces of information from the data information obtained in the step S4 to train the assessment width learning system, and taking the trained assessment width learning system as a current assessment width learning system; the method specifically comprises the following steps:

first, training is performed for an evaluation width learning system:

after training for set times, copying the weight value of the estimated width learning system to the target width learning system to finish updating of the target width learning system;

the method is characterized by comprising the following steps of:in->The rewarding information is rewarding information at the current moment; />Is a discount factor; />Is->Q value corresponding to the action; />Parameters of the system are learned for the target width.

The generation of (2) is performed in the evaluation network, and the best action in the latest state is calculated by the following calculation formula:in->For action->A corresponding maximum value;status +.>Lower corresponding action->Q value of (2); />Is the current state; />To evaluate network parameters in a breadth-learning system; optimal action->Calculating a target Q value corresponding to the target width learning system;

evaluating nodes of a breadth-learning systemTo express +.>；

The connection weight of the network in the system is calculated by adopting the following formula：In->Is->Is a transpose of (2); />Is a regularization parameter;Iis a unit matrix;Yis the output of the system;

the incremental learning specifically comprises the following steps:

incremental learning is achieved by adding enhanced nodes:

newly added enhanced nodeDenoted as->；

The nodes of the new evaluation width learning system are expressed as；

Incremental learning is realized by adding feature nodes:

newly added feature nodeDenoted as->；

Based onPseudo-inverse pair->Performing incremental learning;

New evaluation of connection weights of the network of the breadth-learning systemIs->；

The process according to the invention is further illustrated by the following examples:

simulating traffic signal lamp control on the SUMO platform, wherein the set environment and traffic rules are very close to the real scene, as shown in FIG. 2; the environment is a traffic environment of a four-intersection, including roads, vehicles and traffic lights; the intersection is provided with 4 lanes, the distance from each lane to the zebra crossing is 75 meters, and the leftmost lane is specially used for left turning; the rightmost side is dedicated to right turn and straight lanes; two middle lanes are dedicated to straight travel. The traffic light system is laid out as follows: the leftmost lane is provided with a special traffic light, and the other three lanes share the traffic light. Vehicles at opposite intersections are simultaneously controlled by a set of traffic lights, for example, the left turn signal lights in the south and north directions are simultaneously green lights, and the other direction signal lights are red lights.

The state information is used for describing environment information at each moment, and the description of the state information is performed in a manner shown in fig. 3: the lane with the left turn of 70 meters and the lane with the straight travel (the lane with the right turn and the lane with the straight travel sharing the rightmost side) are respectively divided into 10 equal-length cells, if vehicles exist on the lane, the corresponding cell is marked as 1, and no vehicle is marked as 0. With 20 cells per intersection, then the state information for the entire environment corresponding to 4 intersections should be an 80-dimensional vector.

In this embodiment, the predetermined action is a straight or left turn in the opposite direction, as shown in fig. 4;

during simulation experiments, the input data set comprises real world traffic flow data and randomly generated traffic flow data; setting 100 rounds of experiments each time, setting an initial strategy as a completely random strategy, and reducing the probability of random selection by 0.01 when each round is added with increment of the rounds; until the last round, the probability of random selection is reduced to 0, and action selection is performed completely depending on the learned strategy;

when a comparison experiment is carried out, the method is compared with the existing DQN (Deep Q-Network) method, BQN (Broad Q-Network) method and DDQN (Double Deep Q-Network) method; the DQN method is proposed by scholars such as Volodymyr Mnih, koray Kavukcuoglu and the like in a paper 'Human-level control through deep reinforcement learning' published in Nature journal in 2015, and the scheme realizes reinforcement learning of deep network training by combining deep network and reinforcement learning; BQN the method is proposed by the students of Xin Wei, jialin Zhao and the like in paper 'Broad Reinforcement Learning for Supporting Fast Autonomous IoT' published in IEEE Internet of Things Journal journal in 2020, and the method adopts BLS (Broad Learning System, width learning system) to replace a depth network, so that the time-consuming problem of training time is solved; the DDQN method is based on the DQN method, and in order to prevent overfitting of a target Q value during training, action selection and Q value estimation are carried out in two networks separately;

when in comparison, the algorithm performance is compared in terms of rewards and training time; wherein, the 'rewarding' index is a two-dimensional evaluation index, which is calculated mainly according to the waiting time of the vehicle; the index reflects the effectiveness of the algorithm; the higher the index value is, the more excellent the performance of the algorithm is; the "training time" index is a key index reflecting the learning efficiency of the algorithm, and records the time required for the algorithm to reach a predetermined effect from the beginning of learning, and the shorter the training time, the higher the efficiency of the algorithm.

Comparing the method and the DQN method, the BQN method and Double DQN (DDQN) under the same environment, wherein the comparison data of the rewarding indexes are shown in figure 5; as can be seen from fig. 5, the reward index of the method of the present invention is always better than that of the existing method with the increase of training rounds, thus demonstrating that the method of the present invention has better convergence performance.

Comparing the method with the DQN method and the Double DQN method under the same environment, wherein the comparison data of the training time index is shown in figure 6; as can be seen from FIG. 6, the training time of the method is always lower than that of the prior method along with the increase of training rounds, so that the method has higher efficiency, faster learning speed and faster system response speed.

Claims

1. A traffic signal lamp control method based on multi-strategy reinforcement learning is characterized by comprising the following steps:

if the complex system is determined, continuing the subsequent steps;

2. The traffic light control method based on multi-strategy reinforcement learning according to claim 1, wherein the step S2 comprises the following steps:

if the complex system is determined, the following steps are continued.

3. The traffic light control method based on multi-strategy reinforcement learning according to claim 2, wherein the step S3 comprises the following steps:

4. The traffic light control method based on multi-strategy reinforcement learning according to claim 3, wherein the step S3 specifically comprises the following steps:

based on the current state information, a current evaluation width learning system is adoptedThe optimal action value +_for the next moment is calculated by the following formula>： />In->For action->A corresponding maximum value; />Status +.>Lower corresponding action->Q value of (2); />Is the current state; />To evaluate network parameters in a breadth-learning system; optimal action value->Corresponding to the control strategy of the traffic light.

5. The traffic light control method based on multi-strategy reinforcement learning according to claim 4, wherein the step S4 comprises the following steps:

6. The traffic light control method based on multi-strategy reinforcement learning according to claim 5, wherein the bonus information is calculated by:

the rewarding information is obtained by adopting the following calculation formula：/>In->And->Is a weight value, and ∈>；/>Is a vehicle average waiting time variable, and +.>，/>For the total number of waiting vehicles on the road at time t, < > for>For corresponding waiting time of vehicle，/>For the duration of the traffic light in one phase, +.>For the current speed of the vehicle>Is a prescribed minimum speed of the vehicle; />Is a variable of the longest waiting time and the shortest waiting time of the vehicle, and，/>waiting time for the shortest or longest waiting vehicle.

7. The traffic light control method based on multi-strategy reinforcement learning according to claim 6, wherein the step S5 comprises the following steps:

first, training is performed for an evaluation width learning system:

in the training data of the current round, the training pool is used for storing the training dataAs input to the systemXWill target valueAs an output of the systemY；/>For status information data in the last moment of time, < +.>For action information data in the last moment of time, < +.>As training targetsValue information data;

the feature nodes are mapped to obtain enhanced nodes by adopting the following formula:in->Enhancement nodes for the j-th group; />Mapping features for the nth set; />A randomly generated random weight matrix for random generation; />Is a randomly generated bias term; />Is a second nonlinear mapping function;nthe number of groups of feature nodes;

8. The traffic light control method based on multi-strategy reinforcement learning according to claim 7, wherein the target value is calculated by the following formula:in->The rewarding information is rewarding information at the current moment; />Is a discount factor; />Is->Q value corresponding to the action; />Parameters of the system are learned for the target width.

9. The traffic light control method based on multi-strategy reinforcement learning according to claim 8, wherein the connection weight of the network in the system is calculated by the following steps:

evaluating nodes of a breadth-learning systemTo express +.>；

10. The traffic light control method based on multi-strategy reinforcement learning according to claim 9, wherein the incremental learning comprises the steps of:

incremental learning is achieved by adding enhanced nodes:

newly added enhanced nodeDenoted as->；

The nodes of the new evaluation width learning system are expressed as；

According to the pseudo-inverse theory of the blocking matrix, the method calculates and obtainsWherein->Is thatIs the pseudo-inverse of C, C is the calculated value and +.>，/>Transposed to matrix B and；

Incremental learning is realized by adding feature nodes:

newly added feature nodeDenoted as->；

Based onPseudo-inverse pair->Performing incremental learning;

according to the pseudo-inverse theory of the blocking matrix, the method calculates and obtainsWherein->Is thatIs the pseudo-inverse of C, C is the calculated value and +.>，/>Is a momentTransposed and of array B；