CN115547050A

CN115547050A - Intelligent traffic signal control optimization method and software based on Markov decision process

Info

Publication number: CN115547050A
Application number: CN202211244345.2A
Authority: CN
Inventors: 曹锡玉; 宦涣; 袁月明
Original assignee: Yunkong Zhihang Shanghai Automotive Technology Co ltd
Current assignee: Yunkong Zhihang Shanghai Automotive Technology Co ltd
Priority date: 2022-10-11
Filing date: 2022-10-11
Publication date: 2022-12-30

Abstract

The application provides an intelligent traffic signal control optimization method and software based on a Markov decision process, wherein the method comprises the following steps: acquiring real-time traffic flow data of a target road intersection; predicting the traffic flow condition of the target road intersection at the next moment through a traffic signal control model constructed based on a Markov decision process according to the real-time traffic flow data to obtain a prediction result; executing a control strategy on the traffic signal according to the prediction result; the construction factors of the traffic signal control model comprise a state space and an action space; the state space is used for representing the state of vehicle flow of the target road intersection at each time interval; the action space is used for representing signal control strategies of the target road intersection in different states at different time intervals. The method can at least solve the technical problems that the existing traffic signal control method has poor timeliness and cannot adapt to the situation that the current traffic flow is complicated and changeable.

Description

Intelligent traffic signal control optimization method and software based on Markov decision process

Technical Field

The application relates to the technical field of intelligent traffic, in particular to an intelligent traffic signal control optimization method and software based on a Markov decision process.

Background

The traffic signal control is to distribute the right of way to the traffic flow in different directions in turn on the basis of the canalization of the road junction. The traffic flow is separated from the time, and the traffic signal control parameters, such as the split ratio, are automatically optimized and adjusted through an advanced traffic model and an advanced algorithm, so that the optimal coordination control of the traffic signals of a group of intersections or intersections in one area is realized, and the aim of safely and effectively organizing the traffic flow to pass through the intersections is finally fulfilled. Traffic signal control technology generally goes through four major stages of development:

the first stage is a mechanical traffic signal control technology;

the second phase is a fixed timing traffic signal control technique. The signal period and the green signal ratio of a single signal machine are mainly determined by experience and historical traffic data, and automatic periodic control and multi-period control are realized by a computer;

the third stage is an inductive traffic signal control technique. The control mode of adjusting the signal display time of a single signal machine mainly according to traffic flow data measured by a vehicle detector is divided into semi-induction control (only partial phases of the intersection have induction requests) and full induction control (all phases of the intersection have induction requests);

the fourth stage is a line control technology (traffic signal coordination control of a plurality of adjacent intersections on a road) and a surface control technology (coordination control of all traffic signals in an area). The system comprises a fixed timing coordination control system, a scheme real-time selection coordination control system and a real-time self-adaptive coordination control system.

The current mature drive-by-wire systems are mainly PASSER-Hl, MAXBAND and the like in the United states. PASSER-II is a line control system coordination software which combines the mutual influence method of Bolox and the 'unequal width optimization model' of Lidell and can process multiphase timing. The optimal ratio of traffic demand to traffic capacity of each road is determined, the green signal ratio of each signal is determined according to the optimal ratio, and then the time length, the phase and the time difference of each trial calculation period are changed to determine the optimal signal timing scheme of the widest passing band. MAXBAND optimizes signal time difference according to a hybrid integer programming model of Liphol under the conditions of given cycle duration, green signal ratio, intersection distance and continuous traffic speed, and achieves the effect of determining different optimal bandwidths according to different traffic conditions.

The earliest control systems were transmyts, after which they are more representative: SCOOT, SCATS, ACTRA, UTCS, etc. Most of the surface control schemes collect and analyze traffic information at regular time through a detector, and a traffic model and an optimization program are matched to generate an optimal timing scheme which is finally sent to an intersection signal machine for implementation. The optimization program adopts a small step size asymptotic optimization method, and continuously adjusts three parameters of the split ratio, the period and the time difference in real time, so that the calculated amount is reduced, and the real-time traffic trend is easy to track and master.

However, the inventors found that there are at least the following technical problems in the related art:

in the control methods of traffic signals provided in the traditional wire control technology and the traditional surface control technology, the control rules are solidified, the timeliness is poor, and the method cannot be suitable for meeting the condition that the current traffic flow is complicated and changeable; in addition, although a batch of representative traffic signal control systems such as NATS, hiCon, SMOOTH and the like appear behind the HT-UTCS urban traffic signal control system in China, the system has adaptability and real-time performance to traffic flow characteristics of different cities, different regions and different time periods, and a scheme of collaborative optimization of traffic efficiency, safety, order and the like is realized, and the requirement of real-time performance is met to a certain extent, but the effect is not ideal.

Disclosure of Invention

An object of the present application is to provide an intelligent traffic signal control optimization method and software based on a markov decision process, which are at least used for solving the technical problems that the existing traffic signal control method has poor timeliness and cannot adapt to the situation of current traffic flow complexity.

To achieve the above object, some embodiments of the present application provide a method of controlling a traffic signal, the method including: acquiring real-time traffic flow data of a target road intersection; predicting the traffic flow condition of the target road intersection at the next moment through a traffic signal control model constructed based on a Markov decision process according to the real-time traffic flow data to obtain a prediction result; executing a control strategy on the traffic signal according to the prediction result; the construction factors of the traffic signal control model comprise a state space and an action space; the state space is used for representing the state of the vehicle flow at each time period of the target road intersection; the action space is used for representing signal control strategies of the target road intersection in different states at different time intervals.

Some embodiments of the present application also provide a control device of a traffic signal, the device including: one or more processors; and a memory storing computer program instructions that, when executed, cause the processor to perform the method as described above.

Some embodiments of the present application also provide a computer readable medium having stored thereon computer program instructions executable by a processor to implement the method of controlling a traffic signal.

Compared with the prior art, in the control scheme of the traffic signal provided by the embodiment of the application, the traffic flow condition at the next moment of the target road intersection can be predicted by acquiring the real-time traffic flow data of the target road intersection and then by a traffic signal control model constructed based on a Markov decision process according to the real-time traffic flow data, so that a prediction result is obtained; finally, according to the prediction result, executing a control strategy on the traffic signal; the construction factors of the traffic signal control model comprise a state space and an action space; the state space is used for representing the state of vehicle flow of the target road intersection at each time interval; the action space is used for representing signal control strategies of the target road intersection in different states at different time intervals. The traffic signal control model constructed based on the Markov decision process is added with the definition of a state space and an action space. Therefore, on one hand, each state in the states of the vehicle flow at each time interval of the target road intersection can be exhausted, and meanwhile, in the description of the traffic flow state, the correlation description of the selected characteristic information and other characteristic information can be increased on the basis of the selected characteristic information, so that the data dimension is higher, and the description of the state is more precise; on the other hand, due to the fact that the action space is increased, dynamic adjustment of the control strategy can be achieved while the traffic signal control model constructed based on the Markov decision process predicts traffic flow conditions. Therefore, the scheme provided by the embodiment of the application has higher adaptability and better timeliness with the actual traffic flow condition, and is beneficial to providing a more refined signal control strategy.

Drawings

Fig. 1 is a flowchart of a method for controlling a traffic signal according to an embodiment of the present disclosure;

fig. 2 is a flowchart of another traffic signal control method according to an embodiment of the present disclosure;

fig. 3 is a flowchart of another traffic signal control method according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram illustrating an example of a traffic signal control method according to an embodiment of the present disclosure;

fig. 5 is a schematic diagram of a traffic signal control model constructed based on a markov decision process according to an embodiment of the present application;

fig. 6 is a schematic diagram illustrating training a traffic signal control model constructed based on a markov decision process according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an apparatus according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the present application.

The following terms are used herein.

Markov decision process: the English full name Markov precision Process is called MDP for short, and is used for simulating randomness strategy and return which can be realized by an intelligent agent in an environment with a Markov property in a system state.

deep RL: the deep reinforcement learning is an artificial intelligence method which combines the perception capability of the deep learning and the decision capability of the reinforcement learning, can be directly controlled according to an input image and is closer to a human thinking mode.

The green signal ratio: is the proportional time available for the vehicle to pass through during a period of the traffic light. I.e. the ratio of the effective green time of a certain phase to the period duration.

Road network: the road system is a road system which is formed by interconnecting and interweaving various roads in a certain area and is distributed in a net shape.

In the related technology, the control rules defined by the signal control methods related to timing control and vehicle driving in the traditional line control model and surface control model are solidified, so that the method is not suitable for the current complex and changeable traffic flow (traffic flow abnormity caused by the traffic of different emergent traffic events to complex intersections, and the change abnormity of the traffic flow at different times, such as holidays, early and late peaks, major events and the like), the higher the complexity of the intersections is, the lower the fitting degree of the signal control rules expressing solidification and the traffic flow model in the actual physical world is. In a high-dimensional state space, the traditional RL algorithm cannot effectively calculate a cost function and a strategy function for each state, although some linear function approximation methods in the RL are proposed to solve the problem of the state space, the capabilities of the linear function approximation methods are limited, and in a high-dimensional and complex system, the traditional RL method cannot learn characteristic information of the environment to perform efficient function approximation. The actual traffic flow conditions are complex and change rapidly, the types of characteristic information of the traffic flow are more, and the traditional algorithm is limited in the description of the traffic flow state space.

The timing scheme obtained by the traditional intelligent traffic signal control system is mostly based on an assumed state and a relatively fixed configuration mode depending on historical experience. For example, based on the specific state, assuming that the phase sequence or the period duration is unchanged, only the green signal ratio is adjusted, and the like, the scheme is single and cannot be flexibly adjusted according to the real-time traffic condition. Meanwhile, long-term data monitoring and effect evaluation feedback of the scheme execution effect are lacked, and the capability of autonomous learning and dynamic optimization is not provided.

The embodiment of the application provides a control method of a traffic signal, which comprises the steps of obtaining real-time traffic flow data of a target road intersection, and then predicting the traffic flow condition of the target road intersection at the next moment through a traffic signal control model constructed based on a Markov decision process according to the real-time traffic flow data to obtain a prediction result; finally, according to the prediction result, executing a control strategy on the traffic signal; the construction factors of the traffic signal control model comprise a state space and an action space; the state space is used for representing the state of the vehicle flow at each time period of the target road intersection; the action space is used for representing signal control strategies of the target road intersection in different states at different time intervals.

In the embodiment of the application, the definition of the state space and the action space is added in the traffic signal control model constructed based on the Markov decision process. Therefore, on one hand, each state in the states of the vehicle flow at each time interval of the target road intersection can be exhausted, and meanwhile, in the description of the traffic flow state, the correlation description of the selected characteristic information and other characteristic information can be increased on the basis of the selected characteristic information, so that the data dimensionality is higher, and the description of the state is more precise; on the other hand, because the action space is increased, the traffic signal control model constructed based on the Markov decision process can realize dynamic adjustment of the control strategy while predicting the traffic flow condition. In conclusion, the scheme provided by the embodiment of the application has higher adaptability and better timeliness with the actual traffic flow condition, and is favorable for providing a more refined signal control strategy.

As shown in fig. 1, a method for controlling a traffic signal according to an embodiment of the present application may include the following steps:

and step S101, acquiring real-time traffic flow data of the target road intersection.

And S102, predicting the traffic flow condition of the next moment of the target road intersection through a traffic signal control model constructed based on a Markov decision process according to the real-time traffic flow data to obtain a prediction result.

And step S103, executing a control strategy on the traffic signal according to the prediction result.

The construction factors of the traffic signal control model comprise a state space and an action space; the state space is used for representing the state of vehicle flow of the target road intersection at each time interval; the action space is used for representing signal control strategies of the target road intersection in different states at different time intervals.

In step S101, specifically, real-time traffic flow data of a road intersection of a road target in a city may be acquired through the cloud-controlled infrastructure. The real-time traffic flow data at the target road intersection may be lane-level real-time traffic flow data at the target road intersection.

For step S102, specifically, when the traffic signal control model constructed based on the markov decision process predicts a traffic flow condition, the construction factors of the traffic signal control model used may include a state space and an action space; the state space is used for representing the state of vehicle flow of the target road intersection at each time interval; the action space is used for representing signal control strategies of the target road intersection in different states at different time intervals.

It should be understood that the state referred to herein can be understood as information reflecting the traffic state change of the vehicles at each time interval of the target road intersection, such as the speed of the vehicles at the target road intersection, the density of the vehicles at the target road intersection, and the like; of course, the information for representing the state may also be selected and flexibly set according to actual requirements, and is not specifically limited herein. The action may be that the vehicle is in a certain state at the target road intersection, and all possible signals corresponding to that state control the timing scheme.

In some examples, feature information for characterizing traffic flow data of a road intersection may be extracted, and then the state of vehicle traffic at each time period of the target road intersection may be described based on the feature information.

In step S103, specifically, a control strategy is executed on the traffic signal according to a prediction result output by the traffic signal control model constructed based on the markov decision process.

Compared with the related art, the traffic signal control model constructed based on the Markov decision process has the advantages that the definition of the state space and the action space is added. Therefore, on one hand, each state in the states of the vehicle flow at each time interval of the target road intersection can be exhausted, and meanwhile, in the description of the traffic flow state, the correlation description of the selected characteristic information and other characteristic information can be increased on the basis of the selected characteristic information, so that the data dimensionality is higher, and the description of the state is more precise; on the other hand, due to the fact that the action space is increased, dynamic adjustment of the control strategy can be achieved while the traffic signal control model constructed based on the Markov decision process predicts traffic flow conditions. In conclusion, the scheme provided by the embodiment of the application has higher adaptability and better timeliness with the actual traffic flow condition, and is beneficial to providing a more refined signal control strategy.

In some embodiments of the present application, the state space is determined based on a speed characteristic of a vehicle and a density characteristic of the vehicle at the target road intersection; and the action space is determined according to the phase sequence of the traffic signals of the target road intersection, and the cycle duration and the green signal ratio of the corresponding signal lamps under different phase sequences.

Specifically, the correlation of the speed feature and the density feature can be constructed through a Markov chain to calibrate the vehicle flow at the target road intersection, and the correlation is used as a state space.

The phase sequence of traffic signals, i.e., the phase sequence of traffic lights, as used herein refers to the sequence of traffic flow passing in different directions. Thus, in some examples, the signal control timing scheme output based on the traffic signal control model constructed by the markov decision process may include: and the three-dimensional array comprises variable phase sequence, period duration and split ratio.

Compared with the related art, the traffic signal control method provided by the application has the advantages that the vehicle flow at the target road intersection is calibrated by constructing the correlation of the speed characteristic and the density characteristic, so that the description of the state in the state space is more precise in the embodiment; the action space is determined according to the phase sequence of the traffic signals of the standard road intersection and the corresponding cycle duration and green signal ratio of the signal lamps under different phase sequences, so that the timing scheme of the intersection signals can be dynamically adjusted, the obtained optimized timing scheme is more flexible, and the fitness with the traffic condition is better.

In some embodiments of the present application, the method for determining the state space may include determining the speed characteristic and the density characteristic according to the real-time traffic flow data; determining a flow characteristic of the vehicle according to the speed characteristic and the density characteristic; and determining the state space according to the flow characteristics of the vehicle.

Specifically, referring to fig. 2, the method of the embodiment of the present application may include the steps of:

step S201, determining the speed characteristic and the density characteristic according to the real-time traffic flow data;

step S202, determining the flow characteristic of the vehicle according to the speed characteristic and the density characteristic;

step S203, determining the state space according to the flow characteristics of the vehicle.

For step S201, the speed feature and the density feature of the vehicle may be extracted by analyzing the real-time traffic flow data acquired by the cloud-controlled infrastructure.

For step S202, a correlation among the speed characteristic of the vehicle, the density characteristic of the vehicle, and the flow characteristic of the vehicle may be established through the acquired large amount of real-time traffic flow data.

In some examples, the correlation may be established by the following equation:

Q＝KV

where K represents a density characteristic of the vehicle, V represents a speed characteristic of the vehicle, and Q represents a flow characteristic of the vehicle.

For step S203, the flow characteristics determined in step S202 may be recorded, so as to obtain the state space.

Compared with the related art, the scheme provided by the embodiment of the application has the advantages that in the process of determining the state space, on the basis of the information of the traffic characteristic of the vehicle, the correlation description of the traffic characteristic information of the vehicle and the information of the density characteristic of the vehicle is increased, so that the data dimension is higher, and the description of the state can be more detailed. That is to say, in the description of the state of the traffic flow, the correlation description of the selected characteristic information and other characteristic information is increased on the basis of the selected characteristic information, so that the data dimension is increased, and the purpose of more finely describing the state is achieved.

In some embodiments of the present application, the determining the state space according to the flow characteristics of the vehicle may include: dividing the traffic characteristics of the vehicle according to the traffic state of the target road intersection at each time interval; and obtaining the state space according to the flow characteristics of each divided vehicle.

Referring to fig. 3, a method of an embodiment of the present application may include the steps of:

step S301, dividing the traffic characteristics of the vehicle according to the traffic state of the target road intersection at each time period.

Step S302, obtaining the state space according to the flow characteristics of each divided vehicle.

Specifically, the traffic characteristics of the vehicle can be divided into a plurality of small intervals, and each flow interval represents one of the states of the traffic of the vehicle at each time interval of the target road intersection.

In some examples, the following sequence may be established:

wherein k is the crossing number, and t is the continuous time. Each sequence can be divided into n states, which are n-order multi-element markov chains. For example, if the sequence includes 10 states, the flow rate is divided into 10 equal parts from 0 to X, and the value of n is 10; here, n is an integer greater than or equal to 1, and a value of n may be determined specifically according to an actual data situation.

The sequence established in the embodiment of the present application may be an expression form of the state space.

It is not difficult to find that, compared with the related art, the embodiment of the present application provides a specific implementation manner for determining the state space, and the processing efficiency of data is favorably improved by changing the complex unordered state values into a relatively ordered array in a sequential manner.

In some embodiments of the present application, the executing a control strategy on the traffic signal according to the prediction result may include: after the control strategy is executed on the traffic signal according to the prediction result, feedback information sent by the traffic signal control model is received; and adjusting the control strategy executed on the traffic signal according to the feedback information.

Specifically, in the embodiment of the application, the traffic signal control model constructed based on the markov decision process is tested, and the test effect can be put into the intelligent traffic signal control system for use after reaching the accuracy required by actual use. After the traffic signal control model is put into use, the running data in the actual dynamic environment can be received, relevant calculation is carried out according to the feedback information sent by the traffic signal control model, and training and testing are carried out, so that the traffic signal control model can execute an optimal control strategy under the traffic condition based on real-time change.

Compared with the related art, in the scheme provided by the embodiment of the application, the feedback information sent by the traffic signal control model is received after the control strategy is executed on the traffic signal according to the prediction result; and adjusting a control strategy executed on the traffic signal according to the feedback information, so as to be beneficial to better adapting to complicated and changeable traffic conditions.

In some embodiments of the present application, the construction factor of the traffic signal control model may further include: mapping relation between crossing service level and average delay time of vehicle; the average delay time is used for representing the time lost by the vehicle waiting for the red light at the intersection; the receiving of the feedback information sent by the traffic signal control model may include: and after the traffic signal control model determines the average delay time of the vehicle according to the mapping relation between the intersection service level and the average delay time of the vehicle, receiving feedback information sent by the traffic signal control model according to the average delay time of the vehicle.

In some examples, a mapping relationship between the average delay time and the intersection service level may be established, as shown in table 1:

TABLE 1

In this example, the intersection service level levels are divided into 6 levels.

Compared with the related art, the method provided by the embodiment of the application evaluates the control result of the control strategy after the traffic signal executes the control strategy by taking the average delay time into account, so that the control strategy is convenient to optimize.

In some embodiments of the present application, the adjusting the control strategy performed on the traffic signal according to the feedback information may include: and adjusting a control strategy executed on the traffic signal according to the feedback information through a deep reinforcement learning algorithm.

Specifically, an optimal solution for implementing the control method for the traffic signal provided by the present embodiment may be obtained by using a deep reinforcement learning neural network model as a nonlinear function approximator. Meanwhile, a data set and a test set can be obtained by combining a deep reinforcement learning algorithm and feedback information to perform actual sampling, autonomous optimization of the data set and the test set is facilitated, and optimization iteration is performed on a control strategy.

Among them, deep RL is one of the most successful artificial intelligence models at present, and is also the machine learning paradigm closest to the human learning mode. The method combines a deep neural network and reinforcement learning, so that the function approximation is more effective and stable, and particularly for high-dimensional and infinite state problems, the method is specifically represented as follows: for a high-dimensional state space, the deep RL method is superior to the traditional RL method, and the optimal strategy or the cost function can be learned by training a deep neural network, so that the cost function and the strategy function can be effectively calculated for each state; in terms of action space, the policy-based deep RL method is more suitable for continuous action space than the value-based deep RL method; for discrete action spaces, their controllers typically use DQN and its variants because their structure is simpler compared to policy-based approaches; in large state spaces, different neural network structures, such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), can be used to train reinforcement learning algorithms.

Compared with the related art, the embodiment of the application provides a specific implementation mode for adjusting the control strategy executed on the traffic signal according to the feedback information, and the implementation mode for adjusting the control strategy of the embodiment of the application is flexibly and changeably facilitated.

In some embodiments of the present application, the adjusting, by a deep reinforcement learning algorithm, a control strategy performed on the traffic signal according to the feedback information may include: evaluating the traffic state at the future moment through the deep reinforcement learning algorithm to obtain an evaluation result; and adjusting the control strategy executed on the traffic signal by combining the evaluation result and the feedback information.

In particular, the feedback information may include an instant reward. The feedback information output after an action is executed is an instant reward based on a control strategy output by the traffic signal control model; and the output feedback information which influences the vehicle flow in the future, namely the traffic state at the future moment is evaluated to obtain an evaluation result. The result of this evaluation can also be understood as a bonus. And then, combining the evaluation result and the feedback information, adjusting a control strategy executed on the traffic signal.

Wherein the bonus award may also be referred to as a future award.

Compared with the related art, the method provided by the embodiment of the application considers the current instant reward and the additional reward influencing the future before the control strategy executed on the traffic signal is adjusted, so that the adjusted control strategy is further suitable for complex and variable traffic conditions.

In summary, the control method of the traffic signal provided in the embodiment of the present application predicts the traffic flow condition of the target road intersection at the next moment through the traffic signal control model constructed based on the markov decision process according to the real-time traffic flow data, and obtains the prediction result; finally, according to the prediction result, executing a control strategy on the traffic signal; the construction factors of the traffic signal control model comprise a state space and an action space; the state space is used for representing the state of vehicle flow of the target road intersection at each time interval; the action space is used for representing signal control strategies of the target road intersection in different states at different time intervals. The traffic signal control model constructed based on the Markov decision process is added with the definition of a state space and an action space. Therefore, on one hand, each state in the states of the vehicle flow at each time interval of the target road intersection can be exhausted, and meanwhile, in the description of the traffic flow state, the correlation description of the selected characteristic information and other characteristic information can be increased on the basis of the selected characteristic information, so that the data dimensionality is higher, and the description of the state is more precise; on the other hand, because the action space is increased, the traffic signal control model constructed based on the Markov decision process can realize dynamic adjustment of the control strategy while predicting the traffic flow condition. Therefore, the scheme provided by the embodiment of the application has higher adaptability and better timeliness with the actual traffic flow condition, and is favorable for providing a more refined signal control strategy.

Briefly, referring to fig. 4, in the embodiment of the present application, an urban road intersection intelligent traffic signal control is performed based on a markov decision process ((1)), traffic flow characteristic information is extracted through a cloud base platform, a flow rate, a speed, and a density are defined as a model state space, a variable phase sequence, a cycle length, and a split ratio are defined as an action space, and a transition relationship between states is defined based on traffic flow big data to predict a traffic condition at the next time of the intersection ((2)). And establishing a mapping relation between the intersection service level and the average delay time to obtain instant rewards fed back by the actual environment after the strategy is executed, and predicting and evaluating additional rewards (2) brought by the strategy based on the traffic condition at the future moment, and establishing a reward function according to the instant rewards. And (3) carrying out strategy optimization ((5)) based on a return value in the process of continuous interaction of the system and the environment by using a deep reinforcement learning neural network model, namely a traffic signal control model ((4)) constructed based on a Markov decision process, testing the optimal solution of the model through long-term data set collection, and verifying the optimization effect of the scheme.

In addition, for convenience of understanding the present solution, an example of a traffic signal control model constructed based on a markov decision process is also provided herein, and the traffic signal control model is described in detail below.

In some examples, the traffic signal control model constructed based on the markov decision process may be represented as a five-tuple, such as:

MDP＝<S，A，R，P，γ>

wherein S represents a state space for representing a non-empty finite set of all possible states of vehicles at a target road intersection;

a represents an action space and is used for representing a non-null finite action set of actions which can be executed when the action space is in a state S E to S at a certain time t;

r represents a reward function for characterizing in state S _t Lower execution a _t After the action, state S _t Transfer to S _t+1 Rewards earned while in state;

p represents a transfer function for characterizing the state S _i Transition to State S when action a is performed _j Has a transition probability of

Defining a state action mapping at a certain time to a next state S _t+1 Distribution matrix P(s) _t+1 |s _t ,a _t ) Is a state transfer function;

gamma represents a discount factor for characterizing the importance of the instant prize and the bonus prize.

The five aspects related to the five-tuple are described below respectively:

1. state space S

The state space S represents a non-empty finite set of all possible states that a vehicle may produce at a target road intersection. The traffic flow data of the cloud control basic platform can be subjected to characteristic analysis, the related information of the speed characteristic and the density characteristic is extracted, the correlation relation between the vehicle speed, the vehicle density and the vehicle flow is established, the flow characteristic of the intersection is recorded in a time sequence data mode, and the traffic characteristic is defined as the state space of the model.

Specifically, the correlation between the speed, the density and the flow can be established based on a large amount of traffic flow data of the cloud control base platform.

Q＝KV

Where K is density characteristic information of the vehicle, V is speed characteristic information of the vehicle, and Q is flow characteristic information of the vehicle.

Further, the traffic may be divided into small intervals, each traffic interval representing a state, and the following sequence is established:

wherein k is the intersection number in the road network, and t is the continuous time. Each sequence can be divided into n states, which are n-order multi-element markov chains. For example, if the flow rate is divided by 10 equally from 0 to X, the sequence includes 10 states, and the value of n is 10; here, n is an integer greater than or equal to 1, and a value of n may be determined specifically according to an actual data situation.

2. Action space A

The action space A represents a non-null finite set of actions that can be performed when in a state S ∈ S at a certain time t. Namely, based on the markov model, the control strategy, namely the signal control timing scheme, adopted by the user for the multiple intersections of the road network is defined as an action set. Wherein:

let k denote the intersection number, agent, in the road network _k A traffic signal controller representing the kth intersection. In a road network with n intersections, the set of signal controllers at the intersections is as follows:

K＝{agent ₀ ，agent ₁ ，…agent _k ，…agent _n }

generally, the traffic signal controllers at each intersection can only execute one set of signal control timing schemes at the same time. Suppose A _k Signal controller agent for kth crossing _k Signal-controlled timing scheme of (a) _k Traffic signal controller agent for the k-th intersection _k The action performed is then a _k ∈A _k 。

Based on the traffic signal control model constructed based on the markov decision process in the embodiment of the present application, the phase sequence of the traffic signal and the green light phase increase and decrease duration corresponding to different phase sequences are defined as actions, i.e. a high-dimensional array including the green light increase and decrease duration under different phase sequence strategies can be expressed as:

A＝{b _k (T _m +c _k )，m＝1，2，3，4；k＝1,2……}

wherein,m is the phase sequence of the traffic signal, T is the period duration of the signal lamp, b _k Is the split of Luxin, c _k Is an increase and decrease duration parameter.

In this example, the traffic signal control model constructed based on the markov decision process considers mainly the following four green light phases:

North-South Green(NSG)；

East-West Green(EWG)；

North-South Advance Left Green(NSLG)；

East-West Advance Left Green(EWLG)。

the specific values of the split and the increase-decrease time length parameters can be set according to the traffic signal control experience values.

In practical application, A should satisfy the following conditions:

1) The cycle time length of the signal lamp and the corresponding green lamp phase increase and decrease time length values under different phase sequences are all integers. It can be understood that the cycle duration and the green duration of a general signal lamp have no decimal number.

2) The value of the period duration of the signal lamp is within a preset range. For example, the value of the period duration of the signal lamp may be between 60s and 180s, depending on the empirical value setting.

3) The increase and decrease time length is required to be less than a preset value. For example, according to the setting of the empirical value, the value of the increase/decrease duration should be less than or equal to 120s.

3. Transfer function P

The transfer function P is characterized by a state S _i Transition to State S when action a is executed _j Has a transition probability of

Defining a state action mapping at a certain time to a next state S _t+1 Distribution matrix P(s) _t+1 |s _t ,a _t ) Is a state transfer function.

Specifically, traffic flows of different intersections of a road network at different times have certain time and space correlations, and the traffic flow of a certain intersection at the current time is influenced by a plurality of factors in the previous time step, including the traffic condition of an upstream intersection, the traffic signal control and regulation condition of the intersection at the previous time, and the like. A high-order multi-element Markov chain is applied in the model to determine the state transition relation and construct a state transition matrix.

The flow sequence at the intersection can be expressed as

Here, regarding the traffic change condition of each intersection and each time interval as a state, the probability distribution of the state of the jth sequence at the time r +1 depends on the probability distributions of all the sequences at the times r, r-1, \ 8230 \ 8230;, r-n +1, and can be expressed as:

wherein,

and

is an h-step transition probability matrix from the state of the jth sequence at time r-h +1 to the state of the r-th sequence at time r + 1.

Order to

Wherein,

then the higher order multivariate Markov chain can be represented by the following matrix:

wherein,

next, parameters are estimated

Q should satisfy X = XQ, and a method of minimizing X-XQ is required to solve

Considering the optimization problem:

due to the fact that

Wherein,

the vector is first predicted for each k-orientation quantity

Then state at time t +1

The predicted value of (c):

wherein,

then, for the parameter P _ij And defining a state transition matrix:

in the formula,

the traffic of the intersection k in a certain period of time in the n-step division.

Defining weighted transition probability K based on road directed connection graph of real road network _ij Meaning that when a policy is executed, the probability of a state transitioning from s to s 'is equal to the sum of the products of the probabilities of all actions in the executing state and the probabilities of the corresponding actions enabling the state to transition from s to s':

K _ij ＝P _ij ·C _ij ·α _ij

wherein alpha is _ij Indicating the weight of the flow change between intersection i and intersection j.

Those skilled in the art will appreciate that in a markov model, the closer a historical state is to the present time, the greater the decision impact on the state at the next time. That is, the access point at a position closer to the traffic prediction time has a higher weight, and conversely, the corresponding weight is lower. Thus, in some examples, α may be assumed _i Is the weight of intersection i, α _j Is the weight of intersection j. Then, when i < j, there is α _i ＞α _j . Thus for alpha _i ＞0，1≤i≤k，α _i Is a non-strictly decreasing function.

Data analysis statistics can be carried out on the basis of a cloud basic platform, and empirical values are selected for flow time sequences between intersections i and intersections j. Then, in the road network range containing m intersections, its transfer matrix X _m×n The matrix is:

through the state transition matrix, the state S can be obtained _t State S at the next moment after execution of action a according to the decision _t+1 . I.e. crossing state S at time t _kt Go through decision a _kt And obtaining a sequence matrix of intersection states in the road network at the moment t + 1.

4. Reward function R

Specifically, the reward function R (S) _t ，a _t ，S _t+1 ) Is represented in state S _t Lower sampling a _t After action, the system moves to S _t+1 The reward obtained at state.

For example, the intersection service level levels may be divided into 6 levels as shown in table 1.

Assuming that the average delay duration is d, d = d ₁ +d ₂ Wherein:

d ₁ -uniform delay duration, i.e. delay duration generated by uniform arrival of vehicles, in units of s/pcu;

d ₂ -random additive delays, i.e. the additive delay duration in units of s/pcu, produced by the vehicle arriving at random and causing a supersaturation period;

c is the cycle duration;

λ — the computed split of the lane;

x-the saturation of the calculated lane;

CAP-calculated lane capacity (pcu/h);

t-duration of the analysis period. In some examples, 0.25h may be taken;

e-single crossing signal control type correction coefficient. In some instances, it may be desirable to have a timing control of 0.5; the induction control e varies with the saturation and the extension time of the green light, and the value range is preferably 0.04-0.5.

All data in the formula can be provided by the cloud platform in real time.

In some examples, the reward after performing an action may include an instant reward and an additional reward that affects the future, for k intersections in the road network at a timet instant reward R obtained after action is executed according to strategy _t Expressed as:

R _t ＝R(S _t ，π(S _t ))

1) After the regulated service level grade is raised, the return report value R is made _t ＝1。

2) After the regulated service level grade is reduced, the return value R is returned _t ＝-1。

3) After the regulated and controlled service level grade is basically maintained unchanged, the return value R is returned _t ＝0。

4) Otherwise, the regulation and control effect is not obvious enough to judge the quality, and the return value R is not obvious enough _t ＝0。

Accumulating the return value R based on the above rule _t Can be represented by the following formula:

the goal of MDP is to find the best strategy π, maximizing the cumulative reward expectation E (R) _t S, pi), where the jackpot R is _t Comprises the following steps:

5. the discount factor y is a factor of the discount,

the discount factor gamma controls the importance of the instant prize and the future prize, and can be between 0 and 1, namely gamma belongs to (0, 1). Selecting a small gamma represents that the agent's actions are more concerned with real-time rewards.

And then, the neural network model of deep reinforcement learning can be used as a nonlinear function approximator to solve the optimal solution of intelligent traffic signal control.

Specifically, in reinforcement learning, the goal of an agent is to learn an action selection policy π to guide agent's action selection to maximize expectations, i.e., to select a sequence of actions to obtain the most average reward, meaning the set of agents that the system executes for all intersections of the network at that timeAnd synthesizing into an optimal decision sequence. The system starts from time t according to state S _t Make decision execution a _t And obtaining the state S of the next moment _t+1 And the next decision to perform action a _t+1 And finally, traversing all decision-making time actions of each intersection of the road network to obtain a decision-making sequence, wherein each decision-making sequence can be regarded as a round of MDP.

The system executes the possible strategy pi (s, a) of the action a according to the strategy under the state s:

π(sa)＝P[S _t ＝sA _t ＝a]

defining an action-state cost function Q _π (s, a) evaluating the expected reward of the strategy, which represents the mathematical expectation of the system to decide the obtained reward according to the strategy function pi sequence under the initial condition of the state s, namely expressing that:

according to the Bellman equation, the action-state cost function for the t-th decision is only related to the action-state cost function for the t-1 th decision, so the action-state cost function can be simplified as:

the optimal solution is found by maximizing the action-state cost function Q (s, a) through a greedy strategy, namely, the decision behavior of the system starting from an arbitrary state s can satisfy the action-state cost function Q _π (s, a) taking the maximum value:

in some examples, a reinforced learning algorithm based on cooperative Q-learning can be used to obtain an optimal strategy pi, an MLP evaluation network is constructed by integrating a Q value transfer strategy into deep learning to consider the influence of adjacent intersections, and features are automatically extracted from an original state and approximate an optimal Q value.

In some examples, a target network-assisted intersection evaluation network may also be introduced into the model to calculate according to the following formula:

the target Q value may be defined as:

wherein the action of agent depends not only on its own Q value but also on the Q values of the adjoining intersections.

After the Q value of the adjacent intersection agents is transferred, the Q value of each intersection i can be updated according to the following formula:

wherein, theta _i And

respectively, parameters of an evaluation network and a target network, N is the number of adjacent intersections of the intersection i, and omega _i,j Is the weight of the Q value from intersection j. In practical application, different weights can be set according to the influence of the adjacent intersection j on the intersection i.

In some examples, ω can be determined by the following equation _i,j The value of (c):

wherein, c ₁ 、c ₂ Are all proportionality coefficients, d _ij Represents the distance from the ith crossing to the jth crossing, T _ij Indicating the traffic flow from the ith intersection to the jth intersection. Specifically, the closer to the adjoining intersection, the greater the traffic flow, the greater the influence.

In addition, the loss function for each agent can be determined according to the following formula:

wherein m is the size of the batch,

is in a state

The optimal target Q value for all of the next actions,

to evaluate the output of the network.

Referring to fig. 5, fig. 5 is a schematic diagram of the above process:

specifically, at each time step t, the status s observed by agent (i.e., the traffic sequence at the current time) is entered into the evaluation network. The agent selects an action a to be executed by using a greedy strategy according to the Q value output by the evaluation network (namely, a timing scheme of intersection signal control can be determined according to a phase sequence, cycle duration and green light duration selected by the intersection flow condition at the current moment), and obtains a reward r and enters a next state s';

storing information { s, a, r, s' } obtained by interaction between the current intersection agent and the environment at each time step in an experience pool M, randomly selecting a sample with a certain batch size from the M each time in the training process, training the sample through a double Q network, wherein two network results are the same but two sets of parameters are different, and action selection and strategy evaluation are separated;

when the reward functions of the current Q value and the target Q value are calculated, corresponding experience is sampled from an experience pool of upstream junction agents, the optimal Q value of the downstream junction agents is calculated by utilizing an evaluation network of the upstream junction agents, the Q value is transferred to the current network to calculate a loss function, and each parameter of a timing scheme is updated by using a gradient descent algorithm.

Then, the evaluation model can be trained by using data of a certain period collected by the cloud platform side, a training set, a verification set and a test set are constructed by historical traffic data collected in one month, and iteration is performed month by month.

Similarly, for the convenience of understanding the present solution, a process example for training a traffic signal control model constructed based on a markov decision process is also provided herein.

Specifically, as shown in fig. 6, for any intersection, assume that there are four intersections adjacent to the current intersection agent, and n is the number n ₁ ，n ₂ ，n ₃ ，n ₄ . Then, at time t, the current intersection agent has all its own history data, which can be expressed as:

(s1,a1；s2,a2；…st,at)

the state action data sets of the four adjacent intersections required to be acquired at the time t are as follows:

(sn1,an1；sn2,an2；…snt,ant)

the observation S of the current agent at time t is represented as a collection of the two sets:

S＝(s1,a1；s2,a2…st,at；sn1,an1；sn2,an2；sn3,an3；sn4,an4；)

for any multi-intersection traffic network, the training steps of the deep reinforcement learning algorithm based on the Markov decision process are as follows:

step 1: initializing road network crossing agent _i State matrix of (1), evaluation network parameter theta _i And target network parameters

Discount factor gamma, experience pool max _ size and min _ size, target network update step size C, instant reward value r of each agent is initialized, and iteration number upper limit Iter _max 。

Step 2: inputting the observed real-time data into an evaluation network, agent _i Selecting a phase action using a greedy strategy based on an output value of an evaluation network

Obtaining the grade change condition of the service level based on the calculation of the average delay time of the intersection in the real-time data and obtaining the reward

And enters the next state

Thereafter, an assignment is made according to t = t + 1.

And 3, step 3: will experience the experience

And storing the experience data in an experience pool Mi, deleting old experience data if the experience pool overflows, starting training when the number of the experience pools is more than min _ size, and entering the step 4, otherwise, turning to the step 2.

And 4, step 4: taking data sampled in the current experience pool Mi as input data of a current evaluation network and a target network, and calculating a current value function and a target value function; corresponding historical traffic data are sampled from an experience pool of the adjacent intersection, the historical traffic data are input into an evaluation network of the adjacent intersection, a transfer Q value of the adjacent intersection is obtained, and a loss function is calculated according to a formula.

And 5: updating the network weight theta of the current intersection _i And

and repeat the above calculations for each intersection agent.

Step 6: if t is<Iter _max And s _t Not terminal status), go to step 2.

And then, a test set can be constructed according to the parameters of the trained model, and a test effect is obtained.

Further, if the training test effect reaches the accuracy required by actual use, the model algorithm is integrated into the intelligent traffic signal control system, and the intelligent traffic signal control system performs calculation and feedback training in an actual dynamic environment, so that the system can obtain the sequence decision of the maximum accumulated return under the traffic condition based on real-time change.

Compared with the related technology, in the embodiment of the application, the state and the action value are stored in the deep neural network with s and a as indexes by combining with deep reinforcement learning, and the neural network is updated by continuously interacting with the environment and obtaining the feedback function feedback, so that the state action value stored in the neural network can correctly guide the intelligent agent to execute the sequence decision with the highest return value in the environment. Meanwhile, the constructed intelligent traffic signal control model can dynamically adjust the phase sequence of intersection signal timing and the duration of each phase according to real-time traffic flow information, and has strong adaptability.

In addition, the embodiment of the present application further provides an apparatus, which has a structure as shown in fig. 7 and includes a memory 11 for storing computer readable instructions and a processor 12 for executing the computer readable instructions, wherein when the computer readable instructions are executed by the processor, the processor is triggered to execute the control method of the traffic signal.

In some examples, the device may be an autonomous driving controller.

The methods and/or embodiments of the present application embodiments may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer-readable medium, the computer program comprising program code for performing the method illustrated by the flow chart. The computer program, when executed by a processing unit, performs the above-described functions defined in the method of the present application.

It should be noted that the computer readable medium described herein can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart or block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

As another aspect, the present application further provides a computer-readable medium, where the computer-readable medium may be included in the apparatus described in the foregoing embodiment; or may be separate and not incorporated into the device. The computer-readable medium carries one or more computer-readable instructions executable by a processor to perform the steps of the method and/or solution of the embodiments of the present application as described above.

In a typical configuration of the present application, the terminal, the devices serving the network each include one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, which include both non-transitory and non-transitory, removable and non-removable media, may implement the information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.

In addition, the embodiment of the application also provides a computer program, and the computer program is stored in computer equipment, so that the computer equipment executes the method executed by the control code.

It should be noted that the present application may be implemented in software and/or a combination of software and hardware, for example, implemented using Application Specific Integrated Circuits (ASICs), general purpose computers or any other similar hardware devices. In some embodiments, the software programs of the present application may be executed by a processor to implement the above steps or functions. As such, the software programs (including associated data structures) of the present application can be stored in a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. Additionally, some of the steps or functions of the present application may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.

It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the apparatus claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Claims

1. A method of controlling a traffic signal, the method comprising:

acquiring real-time traffic flow data of a target road intersection;

predicting the traffic flow condition of the next moment of the target road intersection through a traffic signal control model constructed based on a Markov decision process according to the real-time traffic flow data to obtain a prediction result;

executing a control strategy on the traffic signal according to the prediction result;

2. The method of claim 1,

the state space is determined according to the speed characteristics of the vehicles at the target road intersection and the density characteristics of the vehicles;

and the action space is determined according to the phase sequence of the traffic signals of the target road intersection, and the cycle duration and the green signal ratio of the corresponding signal lamps under different phase sequences.

3. The method of claim 2, wherein the method for determining the state space comprises:

determining the speed characteristic and the density characteristic according to the real-time traffic flow data;

determining a flow characteristic of the vehicle according to the speed characteristic and the density characteristic;

and determining the state space according to the flow characteristics of the vehicle.

4. The method of claim 3, wherein determining the state space based on the flow characteristics of the vehicle comprises:

dividing the traffic characteristics of the vehicle according to the traffic state of the target road intersection at each time interval;

and obtaining the state space according to the flow characteristics of each divided vehicle.

5. The method of claim 1, wherein performing a control strategy on the traffic signal based on the prediction comprises:

after the control strategy is executed on the traffic signal according to the prediction result, receiving feedback information sent by the traffic signal control model;

and adjusting the control strategy executed on the traffic signal according to the feedback information.

6. The method of claim 5, wherein the construction factors of the traffic signal control model further comprise: mapping relation between crossing service level and average delay time of vehicle; the average delay time is used for representing the time lost by the vehicle waiting for the red light at the intersection;

the receiving of the feedback information sent by the traffic signal control model includes:

and after the traffic signal control model determines the average delay time of the vehicle according to the mapping relation between the intersection service level and the average delay time of the vehicle, receiving feedback information sent by the traffic signal control model according to the average delay time of the vehicle.

7. The method of claim 5, wherein adjusting the control strategy implemented on the traffic signal based on the feedback information comprises:

and adjusting a control strategy executed on the traffic signal according to the feedback information through a deep reinforcement learning algorithm.

8. The method of claim 7, wherein the adjusting, by the deep reinforcement learning algorithm, the control strategy performed on the traffic signal according to the feedback information comprises:

evaluating the traffic state at the future moment through the deep reinforcement learning algorithm to obtain an evaluation result;

and adjusting the control strategy executed on the traffic signal by combining the evaluation result and the feedback information.

9. A control device for traffic signals, characterized in that the device comprises:

one or more processors; and

a memory storing computer program instructions that, when executed, cause the processor to perform the method of any of claims 1 to 8.

10. A computer readable medium having stored thereon computer program instructions executable by a processor to implement the method of any one of claims 1 to 8.