WO2022258943A1 - Système de régulation du trafic - Google Patents

Système de régulation du trafic Download PDF

Info

Publication number
WO2022258943A1
WO2022258943A1 PCT/GB2022/051240 GB2022051240W WO2022258943A1 WO 2022258943 A1 WO2022258943 A1 WO 2022258943A1 GB 2022051240 W GB2022051240 W GB 2022051240W WO 2022258943 A1 WO2022258943 A1 WO 2022258943A1
Authority
WO
WIPO (PCT)
Prior art keywords
agent
junction
traffic control
action
machine learning
Prior art date
Application number
PCT/GB2022/051240
Other languages
English (en)
Inventor
Shaun HOWELL
Ahmed YASIN
Maksis KNUTINS
Krishna MOOROOGEN
Original Assignee
Vivacity Labs Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Vivacity Labs Ltd filed Critical Vivacity Labs Ltd
Publication of WO2022258943A1 publication Critical patent/WO2022258943A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/01Detecting movement of traffic to be counted or controlled
    • G08G1/0104Measuring and analyzing of parameters relative to traffic conditions
    • G08G1/0108Measuring and analyzing of parameters relative to traffic conditions based on the source of data
    • G08G1/0116Measuring and analyzing of parameters relative to traffic conditions based on the source of data from roadside infrastructure, e.g. beacons
    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/01Detecting movement of traffic to be counted or controlled
    • G08G1/0104Measuring and analyzing of parameters relative to traffic conditions
    • G08G1/0125Traffic data processing
    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/01Detecting movement of traffic to be counted or controlled
    • G08G1/0104Measuring and analyzing of parameters relative to traffic conditions
    • G08G1/0137Measuring and analyzing of parameters relative to traffic conditions for specific applications
    • G08G1/0145Measuring and analyzing of parameters relative to traffic conditions for specific applications for active traffic flow control
    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/07Controlling traffic signals
    • G08G1/081Plural intersections under common control

Definitions

  • the present invention relates to a traffic control system, in particular a system utilising an intelligent agent trained by reinforcement learning to control traffic by controlling the traffic signals at, for example, multiple junctions in a town or a city.
  • Traffic in a city is controlled primarily by traffic signals at junctions.
  • traffic signals keep junctions safe by ensuring that only vehicles coming from particular lanes are able to enter the junction at one time, reducing the risk of collision.
  • signal control provides a clear advantage over just setting out rules as to “rights of way” and relying on drivers to comply with them, since the signal control should ensure that all drivers are given access to the junction within a reasonable length of time, reducing frustration and managing fair and safe access to the shared road space.
  • Traffic signals at junctions are preferably configured as far as possible to keep traffic moving and ensure that the available road space is utilised in the most efficient way. Hence it is common to provide at least some sensors at junctions so that access to the junction is provided taking into account current demand from particular directions, i.e. queues of traffic approaching the junction from a particular lane. Traffic signals at junctions may also be controlled in an attempt to optimise according to certain other goals, for example ensuring that buses run on time by controlling traffic to keep bus routes clear as a priority.
  • WO2020225523 discloses a machine learning agent, primarily trained by reinforcement learning in a simulation.
  • the agent is optimised by its training to maximise performance against goals which can be set according to current policy objectives.
  • the agent will change its strategy if the goals change, and also continually adapts to changes in traffic patterns caused by various external factors.
  • these reinforcement-learning based agents provide very flexible traffic control system which avoid the need for manual, expensive and often non-optimal calibration at regular intervals.
  • the agents of WO2020225523 each control a single junction.
  • each junction is essentially controlled by its own trained agent.
  • the extent to which traffic flow through an entire city-wide network can be optimised is therefore limited.
  • An agent controlling a single junction may make what appears to the agent, according to its training, to be a good decision, but which creates a state in the network as a whole which makes things difficult (i.e. reduces the expected reward value of available actions) for other agents.
  • a single neural network-based agent can be trained using reinforcement learning to control multiple junctions. Controlling two or more junctions is not conceptually very different from controlling one particularly large and complex junction. However, the time taken to train an agent to a point where it will perform well increases as the complexity of the junction increases. The complexity of a single neural network increases exponentially with the number of junctions. Even with the parallelised simulation-based training disclosed, an agent which controls even a few tens of junctions (perhaps the central area of a small town, certainly far short of a major city) takes too long to train. Since one of the key advantages of these machine-learning based systems over manual calibration is the ability to continually re-train and re deploy agents according to changing circumstances and changing priorities, long training times are undesirable and very long training times make the system useless.
  • a traffic control system for use in controlling a road network comprising multiple junctions, the traffic control system comprising: a plurality of sensors for monitoring vehicles and/or other road users at and around each junction; a traffic control agent subsystem; and traffic signals including signal outputs for controlling the vehicles and/or other road users at each junction, the sensors providing inputs to the traffic control agent subsystem, and the traffic control agent subsystem controlling the traffic signals to optimise traffic flow in accordance with one or more goals, in which the traffic control agent subsystem includes a machine learning agent trained by reinforcement learning, the machine learning agent comprising a neural network including: a shared state embedding subnetwork comprising an input layer, one or more connected hidden layers and a shared state output layer; a global value subnetwork comprising an input layer connected to the shared state output layer, one or more connected hidden layers, and a global value output layer representing the overall value of a current state of the roads; for each junction, an advantage subnetwork comprising an input layer connected to the shared state output layer, a pluralit
  • Each junction advantage subnetwork is independent of the other junction advantage subnetworks.
  • the neural network as a whole is “branched”, with a branch per junction being controlled. This means that the complexity scales about linearly with the number of junctions, and so networks can be realistically produced to control, for example, all junctions in a city.
  • the global value subnetwork ensures that the global (city-wide) context is provided when training the network, so that the effects of decisions made on the road network as a whole are taken into account.
  • the network uses what is known as a “dueling” architecture.
  • the global value output layer can be thought of as representing the overall value of a state of the roads.
  • the junction advantage output layer associated with a particular junction represents the advantage, or improvement, expected for each alternative action which could be taken at the junction.
  • a vector representing the expected value of the state of the roads, following taking of each action is calculated. This is the Q-value vector of the well-known Q-learning algorithm.
  • the network can be updated by substituting elements of the Q-value vector for known observed results, calculating a loss vector, and updating by backpropagation.
  • the aggregation layer performs a static aggregation to its inputs.
  • the aggregation layer is not a layer with learnable parameters. Hence the Q-value vector is always calculated in a consistent way from the junction advantage output layers and the global value output layer.
  • the network is primarily trained using a simulation model.
  • a road network simulation model may accept inputs of traffic scenarios and inputs of control decisions, and provide outputs of traffic patterns as a result of the control decisions made.
  • the road network simulation model simulates an entire road network, spanning for example at least a substantial area of a town or city.
  • the road network includes multiple junctions.
  • the neural network agent may be trained by applying control decisions made by the agent to the road network simulation model to collect “memories”. The control decisions are made as a result of the output of the junction value layer of the network when the network is applied to a particular input state.
  • the output of the junction value layer is an expected value associated with each possible action.
  • the agent may take actions both in an “exploitation” mode, where the action is taken which is expected to be the best action, i.e. the action with the best expected value according to current learning, and in an “exploration” mode, where the agent deviates from time to time from the “best” action in order to explore the policy space.
  • training may take place at high speed (i.e. anything faster than real-time, but potentially much faster). Also, training may be parallelised, whereby multiple copies of agents train in multiple copies of the simulation model, noting that due to the exploration which may take place, the same agent in the same simulation may make different choices and thus collect different memories.
  • the memories which are built up by operating the agents in simulations form the basis of updating the agents by reinforcement learning.
  • agents may be continually trained in an agent training system, while a “current best” agent is deployed in a live traffic control system to actually control the traffic in the road network.
  • the agent in the live traffic control system may be replaced as and when a better agent becomes available from the agent training system.
  • agents may learn from real memories collected by the agent controlling real traffic in the live traffic control system. Such memories may be used to update the agent in the live traffic control system in some embodiments, and/or may be shared with the agent training system to be used in updating models currently being trained, in addition to the use of memories from simulations.
  • WO2020225523 discusses in more detail the different options available in terms of training agents in simulations and/or in a live system, for deployment of a best agent at a particular time into a live system. The full description of WO2020225523 is incorporated herein by reference.
  • embodiments of the invention may use techniques to engineer the actions chosen, to maximise learning and therefore convergence to good control strategies, while ensuring the agent training system can be realistically implemented with available hardware and that training can be completed in a reasonable amount of time.
  • exploration i.e. the agent making a choice other than the best choice according to current learning
  • the agent is controlling multiple junctions in a simulation, and making control decisions in relation to all of them.
  • a single junction may be nominated for the duration of a training episode (i.e. the agent being run on the simulator on a particular scenario) in relation to which exploration is allowed.
  • a different junction may be nominated for exploration each time a decision is made.
  • the junctions are cycled through so that each junction gets an equal opportunity for exploration.
  • a junction could be selected at random for exploration every time a decision is made.
  • an “exploration temperature”, e may be used to quantify how likely the agent is to take a random exploration action.
  • the probability of the agent taking the best action according to its current learning is therefore (1 - e).
  • the value of e may be reduced as the training episode progresses, so that random exploration becomes less likely towards the end of the episode, and “exploitation”, i.e. using the best learned strategy, becomes more common.
  • the object in all cases is to create, in the simulations, a set of transitions which can be usefully used to update the neural network and improve its predictive performance in a reasonable amount of time.
  • Each memory is a list of transitions.
  • a transition consists of a (state, action(s), reward, next state) tuple. Note that in some embodiments a transition could include plural actions in the sense that signal changes may have been made at multiple junctions. However in many embodiments the action space is considered junction-by-junction as described in more detail below.
  • the state, next state and reward are generally at the level of the whole road network. Each transition describes a situation an agent was faced with, the decision it took, the immediate reward it received, and the state that it then ended up in.
  • the reward is calculated according to a reward function which may be defined according to objectives which the managers of the road network want to achieve.
  • the reward function may take into account different goals with different weights. For example, a higher reward will be given when waiting times are lower, but when configuring the reward function a choice may be made, for example, to apply more weight to the waiting time for buses than for cars, to try to encourage use of public transport.
  • the starting state is forward-propagated through the network. This generates, according to the current “knowledge” of the network, a vector of Q-values, or “expected maximum total future rewards” for each action which could be taken at each junction.
  • a “ground truth” is then calculated by substituting into the vector the actual immediate reward for the actual action(s) taken in the transition, plus a weighted estimate of total future rewards in the next state associated with that transition.
  • a loss function is then calculated and backpropagation can take place to update the network.
  • a loss vector is calculated for a plurality of transitions, and the plural loss vectors are aggregated into a loss matrix.
  • the loss matrix is then used to derive one or more scalar loss values and this value, or values, are used to update the network by backpropagation.
  • an estimate of future rewards has to be made for the next state associated with the transition.
  • a bootstrapping technique is used.
  • the “next state” is forward-propagated through (a copy of) the network to obtain an estimate of the value of the next state, i.e. the estimated future rewards. This estimate of future rewards is then used together with the real observation (from the simulation) of the immediate reward of the action, to form the basis of the “ground truth” used to calculate the loss function.
  • junctions preferably take actions asynchronously, i.e. there is no requirement that the traffic signals at every junction change at the same time. For this reason, in some embodiments transitions can contain a change of state for multiple junctions, but normally only for a subset of the junctions. In some embodiments the action space is considered junction-by-junction and therefore a transition contains exactly one action.
  • the weights on advantage subnetworks are only updated for the subnetworks associated with junction(s) which changed state in the relevant transition(s). This avoids backpropagating zero error vectors.
  • the extent to which weights are updated in the shared state embedding subnetwork may be modulated according to the number of junction actions associated with the transitions used in the update. For example, a transition (or minibatch of transitions) in which the signals change at four junctions can be used to update the advantage subnetworks associated with those four junctions, and to update the shared state embedding subnetwork and global value subnetwork. The advantage subnetworks associated with junctions in which the traffic signals did not change in that transition (or minibatch) will not be updated at all.
  • Update processes may take place only for a subset of the transitions collected as a result of simulation-based training.
  • a key advantage of simulation-based training is the ability to collect large numbers of memories in a short period of time, including by parallelizing the exploration stage.
  • the update stage cannot be parallelized to the same extent and is computationally intensive. Therefore, preferably a subset of transitions is chosen for use in updating the network.
  • only a subset of the transitions associated with a particular run of an agent in a simulation may be used in the update stages.
  • parallelizing the exploration stage still has the advantage that the transitions which end up being used in updating come from wide exploration across a diversity of states, but the updating can complete in a reasonable amount of time.
  • training the described duelling Q-network can be unstable.
  • the agent s performance can improve gradually, then rapidly become worse, then start to improve gradually again.
  • multiple agents may be trained using different samples from the store of transitions stored from the simulations. In other words, the whole training process may be repeated multiple times (for example, about 10 times).
  • the multiple trained agents can then be evaluated by testing their performance in the simulation, in order to choose a “best” agent.
  • the best agent may, possibly subject to further tests for suitability, be deployed to control a junction.
  • the agent being trained may be tested at regular intervals.
  • the best of the intermediate agents may be chosen, rather than the agent produced by the final update. The effect is essentially to identify when “overtraining” starts to make the agent worse, and take a copy of the good agent before that happened.
  • the global value subnetwork can be discarded from the version of the agent which is deployed to control traffic.
  • the global value subnetwork is not required to make decisions as to the next action according to the best strategy currently learned, but is used during training of the neural network to ensure that the overall (e.g. city wide) state of the roads is taken into account when updating weights.
  • discarding the global value subnetwork does not prevent memories being saved while the agent runs in the live traffic control system. These memories can still be used to train networks (complete networks which include the global value subnetwork) in the agent training system.
  • the input data provided to the neural network, at the input layer of the shared state embedding subnetwork, is envisaged to be engineered input data - for example, data indicating queue lengths and types of vehicles waiting at different lanes, etc.
  • the input data may be generated by other neural networks, for example convolutional neural networks trained to recognise features in video feeds from cameras at junctions.
  • convolutional neural networks trained to recognise features in video feeds from cameras at junctions.
  • the applicant’s previous application WO2018051200 describes identification and tracking of objects in a video feed.
  • Examples of input data which may be provided include queue length, speed, time of day, blocked junction exits, rolling mean queue length, and time since a pedestrian push button was pushed.
  • Figure 1 shows an outline schematic of a neural network traffic control agent according to the invention.
  • FIG. 1 the structure of a neural network traffic control agent, used in the invention, is shown.
  • the agent is used to control traffic signals at junctions.
  • a road network in an urban area for example a town or a city, there will be a large number of junctions.
  • Each junction has traffic signals and a traffic signal controller controls the signals at each junction.
  • a “junction” is not defined exactly in terms of the underlying structure of a road network - there may be borderline cases where a particular set of traffic signals might be controlled as one single complex junction with one traffic signal controller, or alternatively might be controlled by more than one traffic signal controller as multiple, individually more straightforward, junctions. For these purposes therefore a “junction” means the part of the road network controlled by a single traffic signal controller.
  • “City” is used as a shorthand to describe extent of the wider road network, which comprises multiple junctions and is controlled by the described traffic control agent.
  • the “city” may be a town, suburb, or any other area which has a road network comprising multiple junctions.
  • a traffic signal controller can control traffic signals at its junction independently and autonomously. Indeed, it is important that the traffic signal controllers remain able to do this, so that traffic signals at junctions continue to cycle through their stages in the event of a malfunction of, or loss of communication with, the traffic control agent. However, the traffic signal controllers accept external input from the traffic control agent. The input to a traffic signal controller is a requested stage of the traffic signals at the respective junction. A “stage” is defined by which green (go) signals are showing on which lanes coming into the junction. The traffic signal controller will apply rules in order to get to that requested stage, if it can.
  • a traffic signal controller will accept a request to move to a particular stage if its rules allow it to go to the stage directly, or to go to a stage (referred to as a via stage) from which the requested stage can then be got to directly.
  • Moving from one stage (defined by the green signals) to the next stage may take a period of time. For example, in the UK changing a signal from red to green involves showing red and amber for a few seconds. Also, in a particular junction it may be necessary for example to wait for a length of time after one signal has been changed to red, before another signal can be changed to green.
  • Input data 10 to the traffic control agent is encoded into an input layer.
  • the input data 10 defines the current state of the road network, in as much detail as possible.
  • the input data includes information as to the current state of all traffic signals in the network, as well as information as to the current state of traffic - e.g. where there are queues of traffic, how long the queues are, what types of vehicles (cars, buses, vans, lorries etc.) are where, whether there are pedestrians waiting to cross the road at controlled crossings, and so on.
  • Engineering of the input data is outside the scope of this disclosure, but various sensors and techniques will be familiar to the skilled person. In particular, some input data may come from cameras as described in WO2018051200.
  • the input data is processed by a shared state embedding subnetwork 12.
  • This subnetwork is a neural network comprising one or more connected hidden layers. For example, there may be one or two hidden layers with a few thousand nodes in each layer.
  • the width of the subnetwork i.e. the number of nodes per layer, is expected to scale at worst about linearly with the number of junctions in the city.
  • the output layer of the shared state embedding subnetwork 12 may be thought of as being a representation of the state of the network (from the input layer) having been processed to recognise and emphasise pertinent features according to the learned weights in the shared state embedding subnetwork 12.
  • the output layer of the shared state embedding subnetwork 12 is “copied” as the input to each one of the junction advantage subnetworks 14a, 14b, and as the input to the global value subnetwork 16.
  • the global value subnetwork 16 is a neural network comprising one or more connected hidden layers. Again, for example there may be one or two hidden layers with a few thousand nodes in each layer. Again the width of the subnetwork is expected to scale at worst linearly with the number of junctions in the city.
  • the output layer of the global value subnetwork 16 may be thought of as a representation of the value of the current state, as represented by the input data 10.
  • the value in the context of this reinforcement learning system, is dependent on the maximum expected future reward available starting at this state. Since the reward function is defined according to traffic management goals such as reducing congestion, reducing pollution, and ensuring public transport services run on time, the “value” of a particular state may be directly connected with how “good” the traffic situation in the city currently is.
  • the global value subnetwork 16, if trained successfully, will learn to accurately predict the expected future reward associated with states, and therefore how good a particular state is.
  • junction advantage subnetwork 14a, 14b associated with each junction in the city road network.
  • junction advantage subnetwork 14a, 14b there is a junction advantage subnetwork 14a, 14b associated with each junction in the city road network.
  • just two junction advantage subnetworks 14a, 14b are shown.
  • a junction is simply defined as the area controlled by a group of traffic signals which are controlled together and associated with one of the junction advantage layers 14.
  • Each junction advantage subnetwork is a neural network comprising one or more connected hidden layers. For example, there may be one or two hidden layers with about one or two thousand nodes in each layer.
  • each junction advantage subnetwork 14a, 14b represents the expected advantage of each action which could be taken at that junction, given the current state according to input data 10.
  • each junction advantage subnetwork 14a, 14b is aggregated with the output of the global value subnetwork 16. This is indicated in Figure 1 by the intersections at 18a, 18b.
  • the aggregation layers 18a, 18b consistently and deterministically calculate the predicted value, i.e. expected future reward, for each action which could be taken at each junction. This includes a component of the estimated value now, in the current state (from the global value subnetwork 16) and a component of the estimated advantage of each action (from the junction advantage subnetworks 14).
  • the output of the whole network is a vector of values associated with each action which could be taken, at the state represented by input data 10. These are known as the Q-values of the possible actions, in accordance with conventional notation in the literature.
  • the neural network is trained successfully, then from the vector of Q-values an intelligent agent can infer the best action(s) to take in the state represented by input data 10.
  • the best expected future reward may be obtained by changing the traffic signals at one or more junctions, in accordance with the best Q- values.
  • the neural network once trained does not need to be trained further, which could be the case in embodiments where agents once deployed do not learn anything further (but may be replaced at some point by new agents which have learned “offline”), then the global value subnetwork 16 could be omitted from the deployed agent. This is because the best Q-value will be the same as the best advantage value in the output of the junction advantage subnetworks, the output of the global value subnetwork essentially being a fixed offset applied to all advantage values in a particular state.
  • the neural network is trained by collecting “memories”, or “transitions”, in training. When the network is being trained, it is presented with scenarios in the form of input values 10. The network then calculates value, advantage, and hence Q-values and makes a decision as to what action to take, i.e. which traffic signals at which junctions will be changed. Once that decision has been taken, it is applied to the junctions (by changing the relevant traffic signals) and the result is observed. The result of the action is the next state of the traffic system. As a result of the action, there will also be a reward value calculated. This is done according to a reward function which may be tailored (and changed from time to time) depending on what policy objectives are being pursued by those managing the traffic network. For example, the reward function may be biased to heavily penalise late-running buses, but take into account to a lesser extent private cars being held in queues.
  • Each transition is a (state, action, reward, next state) tuple.
  • Qnextstate is the expected value of the next state, and g is a “discount factor” between 0 and 1, typically around 0.95.
  • the discount factor accounts for uncertainty around the future reward.
  • the reward is directly obtained from the stored transition, having been calculated according to the reward function. This represents the immediate reward associated with the action taken.
  • the expected value of the next state is obtained by forward- propagating the next state through the global value subnetwork and adjusting according to a discount function.
  • the extent to which the expected value “looks into the future” can be tuned, and generally algorithms are weighted to put more emphasis on rewards which can be expected to be realised sooner.
  • Options for ways of calculating the expected future reward will be known to the skilled person from the literature on Q-learning generally. To avoid overestimation which is a characteristic problem of Q-learning, double Q-learning may be used which again will be familiar in general to the skilled person. Double Q-learning involves using a different model (i.e.
  • the model used to calculate the expected value of the next state which may be referred to as a “reference model”, is held constant for n update steps. After n update steps, the reference model is replaced by the current model under training. This “snapshot” then remains in place as the reference model for a further n update steps.
  • the reference model may be consistently m steps behind the current model, i.e. the reference model is replaced at every update step with an earlier version of the model from before it had the last m updates.
  • other sources of reference model could be used for an implementation of double Q-learning.
  • the Q-vectorwith the new elements substituted forms the “ground truth” from which a loss vector can be calculated.
  • the loss can then be backpropagated through the network, to update the weights.
  • a plurality of transitions may be forward propagated and substituted, and loss vectors calculated as described.
  • a loss matrix is thereby created, wherein each column of the matrix corresponds to a loss vector arising from a single transition. From the loss matrix a scalar loss value can be determined which may then be used to update the weights in a single backpropagation.
  • the set of transitions used to derive a single loss matrix is referred to as a “minibatch”.
  • future steps may be given decreasing weight, i.e. rewards expected to be realised sooner are worth more.
  • a predetermined number of future steps may be included, with rewards further into the future being discounted at an exponentially decaying rate.
  • junction advantage subnetworks 14 Given that most transitions will contain actions which directly change the traffic signals in only a small minority of the junctions in the city, for a given transition many of the junction advantage subnetworks 14 will not have any Q-values associated with them substituted before the loss function is calculated. Even in some minibatches of transitions, it is possible in some embodiments that some junctions will not see an action and therefore will not have Q-values substituted. To avoid backpropagating zero errors, these subnetworks 14 simply do not have their weights updated at all. Only junction advantage subnetworks 14 associated with junctions which took part in the actions included in the minibatch of transitions have their weights updated.
  • the gradient of all updates in shared layers may be reduced by a factor.
  • the factor may be chosen according to the number of junctions directly affected by the relevant actions in the minibatch.
  • the loss matrix as described may be used to derive a single scalar loss for backpropagation.
  • the loss matrix may be sliced “horizontally”, i.e. a set of rows corresponding to one junction may be treated as a loss matrix associated with that particular junction.
  • a scalar loss may be calculated for that junction, and then backpropagated through the shared layers and the subnetwork associated with that junction.
  • there may be either a single backpropagation update for a single loss, or a per-junction backpropagation update for a per-junction loss.
  • a transition is a (state, action, reward, next state) tuple.
  • the “action” could involve changing the traffic signals at one, two, or more of the junctions in the city.
  • An “action” could even be to do nothing at all. Indeed given that a timestep may be for example less than one second (600ms has been found to be a useful interval in one embodiment), in many timesteps doing nothing may well be the best option, or even the only reasonably good option.
  • next state is simply defined as the state at the next timestep, then the very short time window is likely to hamper information from the reward function flowing into the updates, since the immediate rewards in every transition will be very small, reflecting that the state is not likely to change that much in a very short amount of time. This will result in slow or inadequate learning.
  • a transition boundary could be defined when a positive action actually takes place i.e. “do nothing” actions do not count, and the transition will run from the timestep when the positive action - signal change - is decided upon to the timestep when the agent decides to take another positive action - another signal change.
  • this strategy will tend to be about the same as simply defining a transition to be one timestep, since with enough junctions, at any particular timestep the traffic signals are probably changing at least somewhere in the city.
  • the agent is not likely to be in direct control of the traffic signals. This is in the sense that the traffic control agent’s actions are to request that a particular signal changes to a particular stage at a particular time. This request is made to a traffic light controller at the relevant junction.
  • the agent is essentially feeding into an external request input which is part of a reasonably standard traffic signal controller.
  • traffic signals are safety critical systems and must be guaranteed to follow certain rules.
  • the traffic light controller will enforce various rules, for example, once the light is green it must remain green for at least a minimum period of time. A request which does not comply with these rules will just be ignored by the traffic light controller.
  • an agent may be in theory possible for an agent to request an “illegal” action - although such actions are likely to be penalised in training, it is possible that one could still be requested.
  • the agent is designed never to request an illegal action - the actions which the agent can choose from are masked to give the agent the option only to choose an action which will be accepted.
  • This “masking” of available actions may be achieved by extra layers which implement static logic, i.e. they are not in the “learnable” part of the neural network. If the agent is considered to include these extra layers then it is simply not capable of requesting an action which will not be accepted.
  • An “accepted action”, which may be used to mark a transition boundary, may be defined as any action which led directly to the requested traffic signal configuration.
  • the definition of “accepted action” may be extended to include actions which led to a via stage for the requested configuration.
  • An “accepted action” can be defined in some embodiments to include “do nothing” actions.
  • a “do nothing” action will usually be accepted by a traffic light controller, but possibly not always - a controller is likely to insist on a stage change after a maximum period has elapsed, and if the agent has not requested a stage change within that maximum period then the agent’s request to “do nothing” will be overridden by the traffic light controller.
  • controllers will insist on remaining in a stage for a minimum period, and so after a stage change there will be a length of time in which the agent cannot affect the signals at all. After this minimum stage length the controller will accept a signal change action, but equally will accept a “do nothing” action until the maximum period has elapsed.
  • time boundaries of transitions may be defined for example as:
  • Every timestep with an accepted action at at least one junction represents the start of a transition - may work well for smaller embodiments, but tends to become in practice very similar to “one timestep is one transition” as the number of junctions in the city increases.
  • the “action” at the timestep may be defined to include all traffic signal actions the agent tried to make, at all junctions, irrespective of whether they were accepted by the relevant traffic light controllers.
  • rejected actions could be replaced by the last- accepted action at the relevant junction.
  • the agent includes masking layers, the problem of rejected actions does not arise.
  • the reward associated with the transition is based on what happens in the traffic network after the action was made, according to the reward function which embodies policy objectives.
  • the reward may be calculated as a result of what happens after the action until:
  • the desired traffic signal stage is reached (this may be multiple timesteps later, and may include the time taken to go through a via stage for the desired stage); or
  • the desired traffic signal stage, or a via stage for the desired stage, is reached (this includes only delays involved in a single transition to a new stage; this may still be multiple timesteps, for example to allow an amber signal to show for the requisite period of time).
  • the reward function may be defined according to policy objectives.
  • a typical reward function will seek to reward traffic travelling at reasonable speeds (but perhaps penalise dangerous speeding), penalise stopped time, especially long stop times which may lead to frustration, and penalise stop-start driving which is liable to lead to high levels of toxic emissions. Further examples of factors which may be taken into account in the reward function may be found in the applicant’s previous application WO2020225523.
  • Transitions are created when an agent is allowed to choose actions, i.e. control traffic lights in a city. In most embodiments, this is done primarily in a simulation. Further details as to how agents may generate transitions / memories in a simulation is found in WO2020225523, which is incorporated herein by reference. In some embodiments, transitions may also be saved when the agent is deployed to control traffic signals in the real world, i.e. in a live traffic control system. This may be done irrespective of whether the agent in the live traffic control system goes through update stages.
  • the agent when running (especially in a simulation), is allowed to explore the policy space. In other words, the agent does not always take its currently best-predicted action. At least sometimes, the agent may “explore” by taking an action which it does not predict will be the best action, in order to “find out what happens” and hence learn new information. However, taking completely random (and often very bad) actions is unlikely to result in good performance, since the traffic in the simulation will then be very badly controlled, and the states of the simulated road network will therefore be (hopefully) unrealistic. Therefore there needs to be a balance, when the agent is learning transitions in a simulation, between exploration and exploitation. The goal is for the agent to gain new knowledge, while still controlling traffic reasonably well.
  • an episodic approach to training is used. I.e. an agent will be allowed to run in a particular simulation with particular starting conditions, until that simulation finishes after a number of time steps.
  • the length of an episode may be for example about 20 minutes in “real time” (but since the training is in a simulation, the training may take place faster than that).
  • multiple copies of the agent may run in multiple copies of the simulation, in parallel.
  • exploration i.e. the potential for the agent to choose an action other than its predicted best action, may be enabled only for one junction. Good results have been found by cycling through the junctions episode-by-episode.
  • the junctions can be cycled through one decision point at a time.
  • a junction may be chosen at random for exploration at each decision point.
  • an “exploration temperature” e may be defined. Where exploration is allowed, the chance of taking a random action is given by e. Otherwise, the (expected) best action is chosen - with probability (1 - e). The value of e may be decayed, for example exponentially, during the training episode. Therefore as the episode progresses the actions of the agent become in general less random and more likely to be the expected best action.
  • Boltzmann exploration may be used: each time the agent is asked to make a decision, it uses the relative difference in expected reward between the available options, along with another temperature parameter, to choose a weighted random action.
  • the agent could take any action but will be more likely to take better actions.
  • the agent may be for example twice as likely to choose to take an action which is predicted to be twice as good.
  • the temperature parameter which is increased throughout the training episode, means that it is even more likely to choose the “best” action towards the end of the episode.
  • a “noisy nets” approach may be used. This involves applying noise functions to the parameters of the neural network during the action selection phase of the training. Hence the predicted Q values have a random aspect. The agent then picks the action relating to the “best” noisy Q value, but that may not always be the same as the action which would have had the best Q value had noise not been applied.
  • the noise functions consist of parameters - for example a gaussian noise function has mean and variance parameters. These parameters themselves are part of the neural network model and hence are included in the backpropagation in the network update phase. In theory the amount of noise will naturally reduce over update iterations.
  • an episode may be terminated if particular conditions are met.
  • the conditions for termination are chosen to indicate essentially that the traffic conditions have become too bad - e.g. congestion is too great. This may happen quite often at the early stages of training, before the agent has really started to learn a good strategy.
  • the agent needs to be trained in “difficult” as well as in “easy” scenarios, generally better information will be yielded where agents are trained in scenarios in which they can broadly “succeed”. Transitions yielding useful information will be generated from an already fairly well-trained agent in a difficult scenario, but less useful information is embodied in transitions from an untrained agent making essentially random decisions in a scenario where the traffic situation has already been allowed to become hopeless.
  • early termination of episodes helps to improve the efficiency of the training process.
  • agents are not allowed to take actions at every stage of the simulation.
  • the simulation may be stepped for example 3 or 4 timesteps before the agent is invited to choose an action. This is found to have beneficial effects. It reduces the number of “stage extension” / “do nothing” actions in the transition database, creating a more balanced set of memories for the agent to learn from.
  • transitions are generated, they are saved.
  • the transitions may be saved in a database or any other suitable data structure. Transitions may then be sampled from the database in update stages.
  • Generation of the transitions and use of the transitions for updating is asynchronous, i.e. transitions do not have to be used in the order that they are generated, some transitions may never be used at all, and the transitions being used to update an agent were not necessarily generated by the latest version of the agent.
  • options include:
  • Prioritise transitions based on some evaluation of their potential merit, and remove the least valuable transitions.
  • the merit of a transition within the database can be evaluated in various ways. For a particular model, a transition with a low magnitude loss vector will not result in much learning and therefore may be considered low value. However, it does not necessarily follow that the transition will always be of low value in this sense, for future versions of the model.
  • Model-independent measures of the value of transitions include attaching large value to transitions with (state, next state) pairs which are unusual - intuitively these transitions may relate to actions which have had an unexpected / surprising result and therefore contain new information from which the model can learn.
  • Weighting transitions by total reward is also an option - large positive or large negative rewards associated with transitions indicate transitions which have had a large good or bad effect, and therefore contain useful information.
  • the merit of a particular transition is assessed in the context of its role in the whole database of transitions - in particular it is likely to be desirable to maintain a diverse set of transitions in the database.
  • Transitions are used in “batches” and “minibatches”.
  • a batch is defined as a set of transitions in the database which is used to update the agent from one version which was used to create transitions in the simulation, to a new version which is used to create more transitions in the simulation.
  • a batch update may involve a plurality of loss matrices being calculated, and a loss being backpropagated for each (in some embodiments, there are multiple backpropagations per loss matrix, where the matrix is horizontally sliced and a scalar loss calculated per-junction).
  • a “minibatch” is a subset of a “batch” and is the set of transitions which is used to construct a single loss matrix.
  • a batch of transitions is sampled from the transition database.
  • the transitions could be sampled randomly, with or without replacement in different embodiments.
  • the sampling could be prioritised based on evaluation of expected merit. Again this prioritised sampling could be done with or without replacement. Evaluations of merit at this stage may be done according to the current model (for example, prioritising transitions which will result in high magnitude loss vectors) or using model-independent measures as discussed above.
  • a batch may contain for example 5000-10000 transitions, and is split into multiple minibatches each containing anything from a single transition to a few hundred. Some splitting of batches into minibatches in this way is found to be preferable, but in other embodiments at the extremes, a batch could contain just one minibatch, and hence a single loss matrix will be calculated per update, or a batch could contain as many minibatches as there are transitions, i.e. one transition per minibatch. In that case a loss matrix (of one column) would be generated for every transition.

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Traffic Control Systems (AREA)
  • Feedback Control In General (AREA)

Abstract

Est divulgué un système de régulation du trafic, faisant appel à un agent d'apprentissage automatique formé par apprentissage par renforcement. L'agent d'apprentissage automatique se présente sous la forme d'un réseau neuronal ayant une architecture de duel ramifiée. Des couches de valeur globale fournissent une représentation de la valeur d'un état du réseau routier, et chaque jonction dans une architecture ramifiée est dotée de couches d'avantage de jonction, c'est-à-dire une ramification séparée pour chaque jonction. La sortie des couches d'avantage représente l'avantage associé à des actions spécifiques au niveau de chaque jonction.
PCT/GB2022/051240 2021-06-11 2022-05-17 Système de régulation du trafic WO2022258943A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB2108352.2 2021-06-11
GB2108352.2A GB2607880A (en) 2021-06-11 2021-06-11 Traffic control system

Publications (1)

Publication Number Publication Date
WO2022258943A1 true WO2022258943A1 (fr) 2022-12-15

Family

ID=76954510

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2022/051240 WO2022258943A1 (fr) 2021-06-11 2022-05-17 Système de régulation du trafic

Country Status (2)

Country Link
GB (1) GB2607880A (fr)
WO (1) WO2022258943A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116994444A (zh) * 2023-09-26 2023-11-03 南京邮电大学 一种交通灯控制方法、系统及存储介质

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115294784B (zh) * 2022-06-21 2024-05-14 中国科学院自动化研究所 多路口交通信号灯控制方法、装置、电子设备及存储介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018051200A1 (fr) 2016-09-15 2018-03-22 Vivacity Labs Limited Procédé et système destinés à analyser le mouvement de corps dans un système de trafic
CN110570672A (zh) * 2019-09-18 2019-12-13 浙江大学 一种基于图神经网络的区域交通信号灯控制方法
WO2020225523A1 (fr) 2019-05-08 2020-11-12 Vivacity Labs Limited Système de commande du trafic

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018051200A1 (fr) 2016-09-15 2018-03-22 Vivacity Labs Limited Procédé et système destinés à analyser le mouvement de corps dans un système de trafic
WO2020225523A1 (fr) 2019-05-08 2020-11-12 Vivacity Labs Limited Système de commande du trafic
CN110570672A (zh) * 2019-09-18 2019-12-13 浙江大学 一种基于图神经网络的区域交通信号灯控制方法

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SUNG JOO PARK ET AL: "A HIERARCHICAL NEURAL NETWORK APPROACH TO INTELLIGENT TRAFFIC CONTROL", INTERNATIONAL CONFERENCE ON NEURAL NETWORKS/ WORLD CONGRESS ON COMPUTATIONAL INTELLIGENCE. ORLANDO, JUNE 27 - 29, 1994; [PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON NEURAL NETWORKS (ICNN)], NEW YORK, IEEE, US, vol. 5, 27 June 1994 (1994-06-27), pages 3358 - 3362, XP000532726, ISBN: 978-0-7803-1902-8 *
WU TONG ET AL: "Multi-Agent Deep Reinforcement Learning for Urban Traffic Light Control in Vehicular Networks", IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, IEEE, USA, vol. 69, no. 8, 28 May 2020 (2020-05-28), pages 8243 - 8256, XP011804373, ISSN: 0018-9545, [retrieved on 20200813], DOI: 10.1109/TVT.2020.2997896 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116994444A (zh) * 2023-09-26 2023-11-03 南京邮电大学 一种交通灯控制方法、系统及存储介质
CN116994444B (zh) * 2023-09-26 2023-12-12 南京邮电大学 一种交通灯控制方法、系统及存储介质

Also Published As

Publication number Publication date
GB202108352D0 (en) 2021-07-28
GB2607880A (en) 2022-12-21

Similar Documents

Publication Publication Date Title
Wei et al. Recent advances in reinforcement learning for traffic signal control: A survey of models and evaluation
Wang et al. Adaptive Traffic Signal Control for large-scale scenario with Cooperative Group-based Multi-agent reinforcement learning
Jin et al. A group-based traffic signal control with adaptive learning ability
Abdulhai et al. Reinforcement learning for true adaptive traffic signal control
WO2022121510A1 (fr) Procédé et système de commande de signal de trafic sur la base de gradients de politiques stochastiques et dispositif électronique
Calvo et al. Heterogeneous Multi-Agent Deep Reinforcement Learning for Traffic Lights Control.
Jin et al. Hierarchical multi-agent control of traffic lights based on collective learning
WO2022258943A1 (fr) Système de régulation du trafic
US11783702B2 (en) Method and system for adaptive cycle-level traffic signal control
Prothmann et al. Organic control of traffic lights
Chin et al. Q-learning based traffic optimization in management of signal timing plan
de Oliveira et al. Reinforcement Learning based Control of Traffic Lights in Non-stationary Environments: A Case Study in a Microscopic Simulator.
WO2021051930A1 (fr) Procédé de réglage de signaux et appareil basés sur le modèle de prévision d'action, et dispositif informatique
Rizzo et al. Time critic policy gradient methods for traffic signal control in complex and congested scenarios
US11893886B2 (en) Traffic control system
Sahu et al. Traffic light cycle control using deep reinforcement technique
Long et al. Deep reinforcement learning for transit signal priority in a connected environment
Ivanjko et al. Ramp metering control based on the Q-learning algorithm
KR102329826B1 (ko) 인공지능 기반 교통신호 제어 장치 및 방법
Yin et al. Recursive least-squares temporal difference learning for adaptive traffic signal control at intersection
Płaczek A traffic model based on fuzzy cellular automata
CN114333361A (zh) 一种信号灯配时方法及装置
Khamis et al. Adaptive traffic control system based on Bayesian probability interpretation
Shamsi et al. Reinforcement learning for traffic light control with emphasis on emergency vehicles
Miletić et al. State complexity reduction in reinforcement learning based adaptive traffic signal control

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22725922

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE