CN116758767B - Traffic signal lamp control method based on multi-strategy reinforcement learning - Google Patents

Traffic signal lamp control method based on multi-strategy reinforcement learning Download PDF

Info

Publication number
CN116758767B
CN116758767B CN202311050477.6A CN202311050477A CN116758767B CN 116758767 B CN116758767 B CN 116758767B CN 202311050477 A CN202311050477 A CN 202311050477A CN 116758767 B CN116758767 B CN 116758767B
Authority
CN
China
Prior art keywords
information
training
signal lamp
learning system
current
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311050477.6A
Other languages
Chinese (zh)
Other versions
CN116758767A (en
Inventor
邓晓衡
尹顺梦
桂劲松
万少华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN202311050477.6A priority Critical patent/CN116758767B/en
Publication of CN116758767A publication Critical patent/CN116758767A/en
Application granted granted Critical
Publication of CN116758767B publication Critical patent/CN116758767B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/07Controlling traffic signals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/01Detecting movement of traffic to be counted or controlled
    • G08G1/0104Measuring and analyzing of parameters relative to traffic conditions
    • G08G1/0125Traffic data processing
    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/01Detecting movement of traffic to be counted or controlled
    • G08G1/0104Measuring and analyzing of parameters relative to traffic conditions
    • G08G1/0137Measuring and analyzing of parameters relative to traffic conditions for specific applications
    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/01Detecting movement of traffic to be counted or controlled
    • G08G1/0104Measuring and analyzing of parameters relative to traffic conditions
    • G08G1/0137Measuring and analyzing of parameters relative to traffic conditions for specific applications
    • G08G1/0145Measuring and analyzing of parameters relative to traffic conditions for specific applications for active traffic flow control

Landscapes

  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Traffic Control Systems (AREA)

Abstract

The invention discloses a traffic signal lamp control method based on multi-strategy reinforcement learning, which comprises the steps of obtaining traffic data information of a target traffic signal lamp at the current moment; complexity judgment is carried out by adopting a classification width learning system; calculating an optimal action value at the next moment by adopting a current evaluation width learning system; acquiring state information and a control strategy of the current moment and the historical moment; training an evaluation width learning system; and repeating the traffic signal lamp control based on multi-strategy reinforcement learning at the target traffic signal lamp in real time. The invention provides a novel traffic signal lamp control method in combination with a width learning system, which not only can realize the control of the traffic signal lamp of the urban intersection, but also has higher reliability, better real-time performance and better accuracy.

Description

Traffic signal lamp control method based on multi-strategy reinforcement learning
Technical Field
The invention belongs to the technical field of traffic control systems, and particularly relates to a traffic signal lamp control method based on multi-strategy reinforcement learning.
Background
With the development of economic technology and the improvement of living standard of people, the problem of traffic jam is more and more serious. Therefore, the method for solving the traffic jam problem is significant.
At present, two solutions are mainly available for relieving the traffic jam problem, namely, the traffic jam problem is relieved by newly building a road and improving an infrastructure; secondly, the traffic signal lamp is controlled through an artificial intelligence scheme to relieve traffic problems.
At present, researchers have proposed a large number of traffic signal control schemes based on artificial intelligence technology to optimize traffic signal control strategies. The traffic signal lamp control scheme based on Model Predictive Control (MPC) inputs data into a predictive model by monitoring road traffic flow in real time, and adjusts signal lamp period and phase according to a predictive result; although the scheme can improve the traffic efficiency to a certain extent, due to the dynamic property and the contingency of urban road traffic, the scheme still has the defect of poor accuracy and reliability. In addition, there are also methods for controlling traffic lights based on real-time traffic information, such as dynamically controlling traffic lights based on deep reinforcement learning; the scheme is interacted with the environment through an intelligent body and trained by means of a deep neural network, so that an optimized control strategy is gradually learned; however, the learning process of this type of scheme is slow and cannot meet the real-time control requirement.
Disclosure of Invention
The invention aims to provide a traffic signal lamp control method based on multi-strategy reinforcement learning, which has high reliability, good real-time performance and good accuracy.
The traffic signal lamp control method based on multi-strategy reinforcement learning provided by the invention comprises the following steps:
s1, acquiring traffic data information at a target traffic signal lamp at the current moment;
s2, according to the data information obtained in the step S1, complexity judgment is carried out by adopting a classification width learning system:
if the system is judged to be a simple system, calculating a control strategy of the traffic signal lamp at the next moment according to the acquired data information, ending the control process of the traffic signal lamp at the current moment, and jumping to the step S4;
if the complex system is determined, continuing the subsequent steps;
s3, calculating an optimal action value at the next moment by adopting a current evaluation width learning system according to the current state information; the traffic signal lamp control process at the current moment is ended, and the step S4 is skipped;
s4, acquiring state information, control strategies and rewarding information of the current moment and the historical moment;
s5, extracting a plurality of pieces of information from the data information obtained in the step S4 to train the assessment width learning system, and taking the trained assessment width learning system as a current assessment width learning system;
and S6, repeating the steps S1-S5 in real time to complete the traffic signal lamp control based on multi-strategy reinforcement learning at the target traffic signal lamp.
The step S2 specifically comprises the following steps:
and (3) according to the data information acquired in the step (S1), adopting a classification width learning system to carry out complexity judgment:
if the system is judged to be a simple system, calculating a control strategy of the traffic signal lamp at the next moment by adopting a Webster algorithm according to the acquired data information; the traffic signal lamp control process at the current moment is ended, and the step S4 is skipped;
if the complex system is determined, the following steps are continued.
The step S3 specifically comprises the following steps:
according to the current state information, a current evaluation width learning system is adopted, and based on the state of the current moment, an optimal action value of the next moment is calculated and obtained, wherein the optimal action value corresponds to a control strategy of a traffic signal lamp;
after the calculation is completed, the traffic light control process at the current moment is ended, and the step S4 is skipped.
The step S3 specifically comprises the following steps:
based on the current state information, a current evaluation width learning system is adoptedThe optimal action value +_for the next moment is calculated by the following formula>In->For action->A corresponding maximum value;status +.>Lower corresponding action->Q value of (2); />Is the current state; />To evaluate network parameters in a breadth-learning system; optimal action value->Corresponding to the control strategy of the traffic light.
The step S4 specifically comprises the following steps:
at each moment in timeNext, status information +.>Action information->Bonus information->And status information of the current moment->And store in the buffer;
when the memory buffer is full, the earliest stored state information is replaced with the latest stored state information.
The rewarding information is specifically obtained by adopting the following steps:
the rewarding information is obtained by adopting the following calculation formula:/>In->Andis a weight value, and ∈>;/>Is a vehicle average waiting time variable, and +.>For the total number of waiting vehicles on the road at time t, < > for>For corresponding waiting time of vehicle,/>For the duration of the traffic light in one phase, +.>For the current speed of the vehicle>Minimum for a specified vehicleA speed; />Is a variable of the longest waiting time and the shortest waiting time of the vehicle, and,/>waiting time for the shortest or longest waiting vehicle.
The step S5 specifically comprises the following steps:
before each round of training, a batch of data with the size of a set value P is extracted from a storage buffer area in a uniform sampling mode and is put into a training pool; during each round of training, training data is obtained from a training pool to perform training; after each round of training, taking the evaluation width learning system after the current round of training as a current evaluation width learning system;
first, training is performed for an evaluation width learning system:
in the training data of the current round, the training pool is used for storing the training dataAs input to the systemXTarget value +.>As an output of the systemY;/>For status information data in the last moment of time, < +.>For action information data in the last moment of time, < +.>As training target value information data;
the following algorithm is adopted to inputXMapping to feature space:in->Is the i-th group of characteristic nodes; />A random weight matrix with a set dimension is randomly generated; />Is a randomly generated bias term; />Is a first nonlinear mapping function;
the feature nodes are mapped to obtain enhanced nodes by adopting the following formula:in->Enhancement nodes for the j-th group; />Mapping features for the nth set;a randomly generated random weight matrix for random generation; />Is a randomly generated bias term; />Is a second nonlinear mapping function;nthe number of groups of feature nodes;
connecting the characteristic node and the enhancement node and importing the characteristic node and the enhancement node into an output layer of the system to obtain the output of the systemYIn->Is the m group of enhanced world nodes;mto enhance the group number of nodes; />The connection weight of the network in the system;
if the training of the assessment width learning system does not meet the set requirement, performing incremental learning; the incremental learning comprises adding feature nodes and adding enhancement nodes;
after training for set times, the weight value of the estimated width learning system is copied to the target width learning system to finish updating of the target width learning system.
The target value is calculated by the following formula:in->The rewarding information is rewarding information at the current moment; />Is a discount factor; />Is->Q value corresponding to the action; />Parameters of the system are learned for the target width.
The connection weight of the network in the system is calculated by the following steps:
evaluating nodes of a breadth-learning systemTo express +.>
The connection weight of the network in the system is calculated by adopting the following formulaIn->Is->Is a transpose of (2); />Is a regularization parameter;Iis a unit matrix;Yis the output of the system.
The incremental learning specifically comprises the following steps:
incremental learning is achieved by adding enhanced nodes:
newly added enhanced nodeDenoted as->
The nodes of the new evaluation width learning system are expressed as
According to the pseudo-inverse theory of the blocking matrix, the method calculates and obtainsWhereinIs->Is the pseudo-inverse of C, C is the calculated value and +.>,/>Transposed to matrix B and
new connection weights for the network of the assessment width learning systemIs that
Incremental learning is realized by adding feature nodes:
newly added feature nodeDenoted as->
The corresponding added enhancement nodes are randomly generated as follows: a random weight matrix with proper dimension which is randomly generated; />For randomly generated bias terms, then the nodes of the new evaluation width learning system are represented as
Based onPseudo-inverse pair->Performing incremental learning;
according to the pseudo-inverse theory of the blocking matrix, the method calculates and obtainsWhereinIs->Is the pseudo-inverse of C, C is the calculated value and +.>,/>Transposed of matrix B and +.>
New evaluation of connection weights of the network of the breadth-learning systemIs->
The traffic signal lamp control method based on multi-strategy reinforcement learning, provided by the invention, combines a width learning system, provides a novel traffic signal lamp control method, not only can realize the control of urban intersection traffic signal lamps, but also has higher reliability, better real-time performance and better accuracy.
Drawings
FIG. 1 is a schematic flow chart of the method of the present invention.
FIG. 2 is a schematic diagram of a simulation target of an embodiment of the method of the present invention.
Fig. 3 is a schematic state diagram of an embodiment of the method of the present invention.
FIG. 4 is a schematic diagram illustrating the operation of an embodiment of the method of the present invention.
FIG. 5 is a diagram showing the convergence of the method of the present invention compared with the prior art.
FIG. 6 is a schematic diagram showing the comparison of training time of the method of the present invention with that of the prior art.
Detailed Description
A schematic process flow diagram of the method of the present invention is shown in fig. 1: the invention discloses a traffic signal lamp control method based on multi-strategy reinforcement learning, which comprises the following steps:
s1, acquiring traffic data information at a target traffic signal lamp at the current moment;
s2, according to the data information obtained in the step S1, complexity judgment is carried out by adopting a classification width learning system:
if the system is judged to be a simple system, calculating a control strategy of the traffic signal lamp at the next moment according to the acquired data information, ending the control process of the traffic signal lamp at the current moment, and jumping to the step S4;
if the complex system is determined, continuing the subsequent steps;
the method specifically comprises the following steps:
and (3) according to the data information acquired in the step (S1), adopting a classification width learning system to carry out complexity judgment:
if the system is judged to be a simple system, calculating a control strategy of the traffic signal lamp at the next moment by adopting a Webster algorithm according to the acquired data information; the traffic signal lamp control process at the current moment is ended, and the step S4 is skipped;
if the complex system is determined, continuing the subsequent steps;
s3, calculating an optimal action value at the next moment by adopting a current evaluation width learning system according to the current state information; the traffic signal lamp control process at the current moment is ended, and the step S4 is skipped; the method specifically comprises the following steps:
according to the current state information, a current evaluation width learning system is adopted, and based on the state of the current moment, an optimal action value of the next moment is calculated and obtained, wherein the optimal action value corresponds to a control strategy of a traffic signal lamp;
after the calculation is completed, the control process of the traffic signal lamp at the current moment is finished, and the step S4 is skipped;
the specific implementation comprises the following contents:
based on the current state information, a current evaluation width learning system is adoptedThe optimal action value +_for the next moment is calculated by the following formula>In->For action->A corresponding maximum value;status +.>Lower corresponding action->Q value of (2); />Is the current state; />To evaluate network parameters in a breadth-learning system; optimal action value->A control strategy corresponding to the traffic signal;
s4, acquiring state information, control strategies and rewarding information of the current moment and the historical moment; the method specifically comprises the following steps:
at each moment in timeNext, status information +.>Action information->Bonus information->And status information of the current moment->And store in the buffer;
when the storage buffer area is full, the latest stored state information is used for replacing the earliest stored state information;
in specific implementation, the reward information is calculated by the following steps:
the rewarding information is obtained by adopting the following calculation formula:/>In->Andis a weight value, and ∈>;/>Is a vehicle average waiting time variable, and +.>For the total number of waiting vehicles on the road at time t, < > for>For corresponding waiting time of vehicle,/>For the duration of the traffic light in one phase, +.>For the current speed of the vehicle>Is a prescribed minimum speed of the vehicle; />Is a variable of the longest waiting time and the shortest waiting time of the vehicle, and,/>waiting time for the shortest or longest waiting vehicle;
s5, extracting a plurality of pieces of information from the data information obtained in the step S4 to train the assessment width learning system, and taking the trained assessment width learning system as a current assessment width learning system; the method specifically comprises the following steps:
before each round of training, a batch of data with the size of a set value P is extracted from a storage buffer area in a uniform sampling mode and is put into a training pool; during each round of training, training data is obtained from a training pool to perform training; after each round of training, taking the evaluation width learning system after the current round of training as a current evaluation width learning system;
first, training is performed for an evaluation width learning system:
in the training data of the current round, the training pool is used for storing the training dataAs input to the systemXTarget value +.>As an output of the systemY;/>For status information data in the last moment of time, < +.>For action information data in the last moment of time, < +.>As training target value information data;
the following algorithm is adopted to inputXMapping to feature space:in->Is the i-th group of characteristic nodes; />A random weight matrix with a set dimension is randomly generated; />Is a randomly generated bias term; />Is a first nonlinear mapping function;
the feature nodes are mapped to obtain enhanced nodes by adopting the following formula:in->Enhancement nodes for the j-th group; />Mapping features for the nth set;a randomly generated random weight matrix for random generation; />Is a randomly generated bias term; />Is a second nonlinear mapping function;nthe number of groups of feature nodes;
connecting the characteristic node and the enhancement node and importing the characteristic node and the enhancement node into an output layer of the system to obtain the output of the systemYIn->Is the m group of enhanced world nodes;mto enhance the group number of nodes; />The connection weight of the network in the system;
if the training of the assessment width learning system does not meet the set requirement, performing incremental learning; the incremental learning comprises adding feature nodes and adding enhancement nodes;
after training for set times, copying the weight value of the estimated width learning system to the target width learning system to finish updating of the target width learning system;
the method is characterized by comprising the following steps of:in->The rewarding information is rewarding information at the current moment; />Is a discount factor; />Is->Q value corresponding to the action; />Parameters of the system are learned for the target width.
The generation of (2) is performed in the evaluation network, and the best action in the latest state is calculated by the following calculation formula:in->For action->A corresponding maximum value;status +.>Lower corresponding action->Q value of (2); />Is the current state; />To evaluate network parameters in a breadth-learning system; optimal action->Calculating a target Q value corresponding to the target width learning system;
the connection weight of the network in the system is calculated by the following steps:
evaluating nodes of a breadth-learning systemTo express +.>
The connection weight of the network in the system is calculated by adopting the following formulaIn->Is->Is a transpose of (2); />Is a regularization parameter;Iis a unit matrix;Yis the output of the system;
the incremental learning specifically comprises the following steps:
incremental learning is achieved by adding enhanced nodes:
newly added enhanced nodeDenoted as->
The nodes of the new evaluation width learning system are expressed as
According to the pseudo-inverse theory of the blocking matrix, the method calculates and obtainsWhereinIs->Is the pseudo-inverse of C, C is the calculated value and +.>,/>Transposed to matrix B and
new connection weights for the network of the assessment width learning systemIs that
Incremental learning is realized by adding feature nodes:
newly added feature nodeDenoted as->
The corresponding added enhancement nodes are randomly generated as follows: a random weight matrix with proper dimension which is randomly generated; />For randomly generated bias terms, then the nodes of the new evaluation width learning system are represented as
Based onPseudo-inverse pair->Performing incremental learning;
according to the pseudo-inverse theory of the blocking matrix, the method calculates and obtainsWhereinIs->Is the pseudo-inverse of C, C is the calculated value and +.>,/>Transposed of matrix B and +.>
New evaluation of connection weights of the network of the breadth-learning systemIs->
And S6, repeating the steps S1-S5 in real time to complete the traffic signal lamp control based on multi-strategy reinforcement learning at the target traffic signal lamp.
The process according to the invention is further illustrated by the following examples:
simulating traffic signal lamp control on the SUMO platform, wherein the set environment and traffic rules are very close to the real scene, as shown in FIG. 2; the environment is a traffic environment of a four-intersection, including roads, vehicles and traffic lights; the intersection is provided with 4 lanes, the distance from each lane to the zebra crossing is 75 meters, and the leftmost lane is specially used for left turning; the rightmost side is dedicated to right turn and straight lanes; two middle lanes are dedicated to straight travel. The traffic light system is laid out as follows: the leftmost lane is provided with a special traffic light, and the other three lanes share the traffic light. Vehicles at opposite intersections are simultaneously controlled by a set of traffic lights, for example, the left turn signal lights in the south and north directions are simultaneously green lights, and the other direction signal lights are red lights.
The state information is used for describing environment information at each moment, and the description of the state information is performed in a manner shown in fig. 3: the lane with the left turn of 70 meters and the lane with the straight travel (the lane with the right turn and the lane with the straight travel sharing the rightmost side) are respectively divided into 10 equal-length cells, if vehicles exist on the lane, the corresponding cell is marked as 1, and no vehicle is marked as 0. With 20 cells per intersection, then the state information for the entire environment corresponding to 4 intersections should be an 80-dimensional vector.
In this embodiment, the predetermined action is a straight or left turn in the opposite direction, as shown in fig. 4;
during simulation experiments, the input data set comprises real world traffic flow data and randomly generated traffic flow data; setting 100 rounds of experiments each time, setting an initial strategy as a completely random strategy, and reducing the probability of random selection by 0.01 when each round is added with increment of the rounds; until the last round, the probability of random selection is reduced to 0, and action selection is performed completely depending on the learned strategy;
when a comparison experiment is carried out, the method is compared with the existing DQN (Deep Q-Network) method, BQN (Broad Q-Network) method and DDQN (Double Deep Q-Network) method; the DQN method is proposed by scholars such as Volodymyr Mnih, koray Kavukcuoglu and the like in a paper 'Human-level control through deep reinforcement learning' published in Nature journal in 2015, and the scheme realizes reinforcement learning of deep network training by combining deep network and reinforcement learning; BQN the method is proposed by the students of Xin Wei, jialin Zhao and the like in paper 'Broad Reinforcement Learning for Supporting Fast Autonomous IoT' published in IEEE Internet of Things Journal journal in 2020, and the method adopts BLS (Broad Learning System, width learning system) to replace a depth network, so that the time-consuming problem of training time is solved; the DDQN method is based on the DQN method, and in order to prevent overfitting of a target Q value during training, action selection and Q value estimation are carried out in two networks separately;
when in comparison, the algorithm performance is compared in terms of rewards and training time; wherein, the 'rewarding' index is a two-dimensional evaluation index, which is calculated mainly according to the waiting time of the vehicle; the index reflects the effectiveness of the algorithm; the higher the index value is, the more excellent the performance of the algorithm is; the "training time" index is a key index reflecting the learning efficiency of the algorithm, and records the time required for the algorithm to reach a predetermined effect from the beginning of learning, and the shorter the training time, the higher the efficiency of the algorithm.
Comparing the method and the DQN method, the BQN method and Double DQN (DDQN) under the same environment, wherein the comparison data of the rewarding indexes are shown in figure 5; as can be seen from fig. 5, the reward index of the method of the present invention is always better than that of the existing method with the increase of training rounds, thus demonstrating that the method of the present invention has better convergence performance.
Comparing the method with the DQN method and the Double DQN method under the same environment, wherein the comparison data of the training time index is shown in figure 6; as can be seen from FIG. 6, the training time of the method is always lower than that of the prior method along with the increase of training rounds, so that the method has higher efficiency, faster learning speed and faster system response speed.

Claims (10)

1. A traffic signal lamp control method based on multi-strategy reinforcement learning is characterized by comprising the following steps:
s1, acquiring traffic data information at a target traffic signal lamp at the current moment;
s2, according to the data information obtained in the step S1, complexity judgment is carried out by adopting a classification width learning system:
if the system is judged to be a simple system, calculating a control strategy of the traffic signal lamp at the next moment according to the acquired data information, ending the control process of the traffic signal lamp at the current moment, and jumping to the step S4;
if the complex system is determined, continuing the subsequent steps;
s3, calculating an optimal action value at the next moment by adopting a current evaluation width learning system according to the current state information; the traffic signal lamp control process at the current moment is ended, and the step S4 is skipped;
s4, acquiring state information, control strategies and rewarding information of the current moment and the historical moment;
s5, extracting a plurality of pieces of information from the data information obtained in the step S4 to train the assessment width learning system, and taking the trained assessment width learning system as a current assessment width learning system;
and S6, repeating the steps S1-S5 in real time to complete the traffic signal lamp control based on multi-strategy reinforcement learning at the target traffic signal lamp.
2. The traffic light control method based on multi-strategy reinforcement learning according to claim 1, wherein the step S2 comprises the following steps:
and (3) according to the data information acquired in the step (S1), adopting a classification width learning system to carry out complexity judgment:
if the system is judged to be a simple system, calculating a control strategy of the traffic signal lamp at the next moment by adopting a Webster algorithm according to the acquired data information; the traffic signal lamp control process at the current moment is ended, and the step S4 is skipped;
if the complex system is determined, the following steps are continued.
3. The traffic light control method based on multi-strategy reinforcement learning according to claim 2, wherein the step S3 comprises the following steps:
according to the current state information, a current evaluation width learning system is adopted, and based on the state of the current moment, an optimal action value of the next moment is calculated and obtained, wherein the optimal action value corresponds to a control strategy of a traffic signal lamp;
after the calculation is completed, the traffic light control process at the current moment is ended, and the step S4 is skipped.
4. The traffic light control method based on multi-strategy reinforcement learning according to claim 3, wherein the step S3 specifically comprises the following steps:
based on the current state information, a current evaluation width learning system is adoptedThe optimal action value +_for the next moment is calculated by the following formula>: />In->For action->A corresponding maximum value; />Status +.>Lower corresponding action->Q value of (2); />Is the current state; />To evaluate network parameters in a breadth-learning system; optimal action value->Corresponding to the control strategy of the traffic light.
5. The traffic light control method based on multi-strategy reinforcement learning according to claim 4, wherein the step S4 comprises the following steps:
at each moment in timeNext, status information +.>Action information->Bonus information->And status information of the current moment->And store in the buffer;
when the memory buffer is full, the earliest stored state information is replaced with the latest stored state information.
6. The traffic light control method based on multi-strategy reinforcement learning according to claim 5, wherein the bonus information is calculated by:
the rewarding information is obtained by adopting the following calculation formula:/>In->And->Is a weight value, and ∈>;/>Is a vehicle average waiting time variable, and +.>,/>For the total number of waiting vehicles on the road at time t, < > for>For corresponding waiting time of vehicle,/>For the duration of the traffic light in one phase, +.>For the current speed of the vehicle>Is a prescribed minimum speed of the vehicle; />Is a variable of the longest waiting time and the shortest waiting time of the vehicle, and,/>waiting time for the shortest or longest waiting vehicle.
7. The traffic light control method based on multi-strategy reinforcement learning according to claim 6, wherein the step S5 comprises the following steps:
before each round of training, a batch of data with the size of a set value P is extracted from a storage buffer area in a uniform sampling mode and is put into a training pool; during each round of training, training data is obtained from a training pool to perform training; after each round of training, taking the evaluation width learning system after the current round of training as a current evaluation width learning system;
first, training is performed for an evaluation width learning system:
in the training data of the current round, the training pool is used for storing the training dataAs input to the systemXWill target valueAs an output of the systemY;/>For status information data in the last moment of time, < +.>For action information data in the last moment of time, < +.>As training targetsValue information data;
the following algorithm is adopted to inputXMapping to feature space:in->Is the i-th group of characteristic nodes; />A random weight matrix with a set dimension is randomly generated; />Is a randomly generated bias term; />Is a first nonlinear mapping function;
the feature nodes are mapped to obtain enhanced nodes by adopting the following formula:in->Enhancement nodes for the j-th group; />Mapping features for the nth set; />A randomly generated random weight matrix for random generation; />Is a randomly generated bias term; />Is a second nonlinear mapping function;nthe number of groups of feature nodes;
connecting the characteristic node and the enhancement node and importing the characteristic node and the enhancement node into an output layer of the system to obtain the output of the systemYIn->Is the m group of enhanced world nodes;mto enhance the group number of nodes; />The connection weight of the network in the system;
if the training of the assessment width learning system does not meet the set requirement, performing incremental learning; the incremental learning comprises adding feature nodes and adding enhancement nodes;
after training for set times, the weight value of the estimated width learning system is copied to the target width learning system to finish updating of the target width learning system.
8. The traffic light control method based on multi-strategy reinforcement learning according to claim 7, wherein the target value is calculated by the following formula:in->The rewarding information is rewarding information at the current moment; />Is a discount factor; />Is->Q value corresponding to the action; />Parameters of the system are learned for the target width.
9. The traffic light control method based on multi-strategy reinforcement learning according to claim 8, wherein the connection weight of the network in the system is calculated by the following steps:
evaluating nodes of a breadth-learning systemTo express +.>
The connection weight of the network in the system is calculated by adopting the following formulaIn->Is->Is a transpose of (2); />Is a regularization parameter;Iis a unit matrix;Yis the output of the system.
10. The traffic light control method based on multi-strategy reinforcement learning according to claim 9, wherein the incremental learning comprises the steps of:
incremental learning is achieved by adding enhanced nodes:
newly added enhanced nodeDenoted as->
The nodes of the new evaluation width learning system are expressed as
According to the pseudo-inverse theory of the blocking matrix, the method calculates and obtainsWherein->Is thatIs the pseudo-inverse of C, C is the calculated value and +.>,/>Transposed to matrix B and
new connection weights for the network of the assessment width learning systemIs that
Incremental learning is realized by adding feature nodes:
newly added feature nodeDenoted as->
The corresponding added enhancement nodes are randomly generated as follows: a random weight matrix with proper dimension which is randomly generated; />For randomly generated bias terms, then the nodes of the new evaluation width learning system are represented as
Based onPseudo-inverse pair->Performing incremental learning;
according to the pseudo-inverse theory of the blocking matrix, the method calculates and obtainsWherein->Is thatIs the pseudo-inverse of C, C is the calculated value and +.>,/>Is a momentTransposed and of array B
New evaluation of connection weights of the network of the breadth-learning systemIs->
CN202311050477.6A 2023-08-21 2023-08-21 Traffic signal lamp control method based on multi-strategy reinforcement learning Active CN116758767B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311050477.6A CN116758767B (en) 2023-08-21 2023-08-21 Traffic signal lamp control method based on multi-strategy reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311050477.6A CN116758767B (en) 2023-08-21 2023-08-21 Traffic signal lamp control method based on multi-strategy reinforcement learning

Publications (2)

Publication Number Publication Date
CN116758767A CN116758767A (en) 2023-09-15
CN116758767B true CN116758767B (en) 2023-10-20

Family

ID=87953777

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311050477.6A Active CN116758767B (en) 2023-08-21 2023-08-21 Traffic signal lamp control method based on multi-strategy reinforcement learning

Country Status (1)

Country Link
CN (1) CN116758767B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117933499A (en) * 2024-03-22 2024-04-26 中国铁建电气化局集团有限公司 Invasion risk prediction method, device and storage medium for high-speed railway catenary

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103559795A (en) * 2013-11-07 2014-02-05 青岛海信网络科技股份有限公司 Multi-strategy and multi-object self-adaptation traffic control method
CN106778853A (en) * 2016-12-07 2017-05-31 中南大学 Unbalanced data sorting technique based on weight cluster and sub- sampling
CN111243271A (en) * 2020-01-11 2020-06-05 多伦科技股份有限公司 Single-point intersection signal control method based on deep cycle Q learning
CN111696345A (en) * 2020-05-08 2020-09-22 东南大学 Intelligent coupled large-scale data flow width learning rapid prediction algorithm based on network community detection and GCN
CN111696370A (en) * 2020-06-16 2020-09-22 西安电子科技大学 Traffic light control method based on heuristic deep Q network

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2583747B (en) * 2019-05-08 2023-12-06 Vivacity Labs Ltd Traffic control system
CA3162665A1 (en) * 2021-06-14 2022-12-14 The Governing Council Of The University Of Toronto Method and system for traffic signal control with a learned model

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103559795A (en) * 2013-11-07 2014-02-05 青岛海信网络科技股份有限公司 Multi-strategy and multi-object self-adaptation traffic control method
CN106778853A (en) * 2016-12-07 2017-05-31 中南大学 Unbalanced data sorting technique based on weight cluster and sub- sampling
CN111243271A (en) * 2020-01-11 2020-06-05 多伦科技股份有限公司 Single-point intersection signal control method based on deep cycle Q learning
CN111696345A (en) * 2020-05-08 2020-09-22 东南大学 Intelligent coupled large-scale data flow width learning rapid prediction algorithm based on network community detection and GCN
CN111696370A (en) * 2020-06-16 2020-09-22 西安电子科技大学 Traffic light control method based on heuristic deep Q network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Ruijie Zhu 等.Multi-agent broad reinforcement learning for intelligent traffic light control.《Information Sciences》.2022,509-525. *
宋炯 等.Q学习算法在多十字路口交通信号控制模型中的运用分析.《价值工程》.2012,136-137. *

Also Published As

Publication number Publication date
CN116758767A (en) 2023-09-15

Similar Documents

Publication Publication Date Title
CN111260937B (en) Cross traffic signal lamp control method based on reinforcement learning
CN111696370B (en) Traffic light control method based on heuristic deep Q network
CN109215355A (en) A kind of single-point intersection signal timing optimization method based on deeply study
CN111260118B (en) Vehicle networking traffic flow prediction method based on quantum particle swarm optimization strategy
CN110570672B (en) Regional traffic signal lamp control method based on graph neural network
CN114038212B (en) Signal lamp control method based on two-stage attention mechanism and deep reinforcement learning
CN116758767B (en) Traffic signal lamp control method based on multi-strategy reinforcement learning
CN113963555B (en) Depth combined with state prediction control method for reinforcement learning traffic signal
CN113538910A (en) Self-adaptive full-chain urban area network signal control optimization method
CN113223305A (en) Multi-intersection traffic light control method and system based on reinforcement learning and storage medium
CN104766485A (en) Traffic light optimization time distribution method based on improved fuzzy control
CN115578870B (en) Traffic signal control method based on near-end policy optimization
CN115691167A (en) Single-point traffic signal control method based on intersection holographic data
CN114120670B (en) Method and system for traffic signal control
CN113409576B (en) Bayesian network-based traffic network dynamic prediction method and system
CN110021168B (en) Grading decision method for realizing real-time intelligent traffic management under Internet of vehicles
CN115331460B (en) Large-scale traffic signal control method and device based on deep reinforcement learning
CN116824848A (en) Traffic signal optimization control method based on Bayesian deep Q network
CN115472023A (en) Intelligent traffic light control method and device based on deep reinforcement learning
CN113077642B (en) Traffic signal lamp control method and device and computer readable storage medium
CN115762128A (en) Deep reinforcement learning traffic signal control method based on self-attention mechanism
CN115512558A (en) Traffic light signal control method based on multi-agent reinforcement learning
CN108597239B (en) Traffic light control system and method based on Markov decision
CN113487870A (en) Method for generating anti-disturbance to intelligent single intersection based on CW (continuous wave) attack
CN112766533A (en) Shared bicycle demand prediction method based on multi-strategy improved GWO _ BP neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant