CN114141033A - Traffic light cooperation control method, device, equipment and computer readable storage medium - Google Patents

Traffic light cooperation control method, device, equipment and computer readable storage medium Download PDF

Info

Publication number
CN114141033A
CN114141033A CN202111321571.1A CN202111321571A CN114141033A CN 114141033 A CN114141033 A CN 114141033A CN 202111321571 A CN202111321571 A CN 202111321571A CN 114141033 A CN114141033 A CN 114141033A
Authority
CN
China
Prior art keywords
preset
signal lamp
reinforcement learning
action
state set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111321571.1A
Other languages
Chinese (zh)
Inventor
余剑峤
高嘉时
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southwest University of Science and Technology
Original Assignee
Southwest University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southwest University of Science and Technology filed Critical Southwest University of Science and Technology
Priority to CN202111321571.1A priority Critical patent/CN114141033A/en
Publication of CN114141033A publication Critical patent/CN114141033A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/07Controlling traffic signals
    • G08G1/081Plural intersections under common control
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/09Arrangements for giving variable traffic instructions
    • G08G1/095Traffic lights

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Traffic Control Systems (AREA)

Abstract

The invention discloses a traffic light cooperation control method, a traffic light cooperation control device, traffic light cooperation control equipment and a computer readable storage medium. The invention inputs the state of the traffic light into the reinforcement learning model for reinforcement learning, and after a plurality of times of training by the training algorithm, the invention can obtain better performance on the traffic network scales of various scales, especially on a large-scale traffic network.

Description

Traffic light cooperation control method, device, equipment and computer readable storage medium
Technical Field
The invention relates to the technical field of reinforcement learning, in particular to a traffic light cooperation control method, a device, equipment and a computer readable storage medium.
Background
At present, traffic congestion in urban areas has caused serious problems, such as long waiting time, high fuel consumption, and increased emission of harmful gases. One of the effective methods to address congestion is to control the traffic lights more intelligently. Since control strategies between the signal lights of an intersection are highly interdependent, cooperative control of traffic signals in a large area is crucial.
Reinforcement Learning (RL) techniques are an effective method of cooperatively controlling traffic signals. In earlier approaches, the information and parameters were not shared between individual traffic lights, but rather the own control network was updated independently. Such a distributed control method causes a contradiction between the individual optimal strategy and the global optimal strategy, which in turn leads to non-stationarity of strategy convergence. The centralized RL solves the problem of conflict among individual strategies by generating a control strategy through the joint state among a plurality of signal lamps. However, the joint state space of the global individual is large in dimension, which causes a computational burden in time and space. Another advanced RL method, coight, uses a local signal that is designed to note that the network selects a neighbor to participate in the decision of the target signal. This network avoids the dimension problem, but the neighboring individuals for communication are predefined and fixed, making it difficult to use a dynamically changing traffic environment.
Disclosure of Invention
The present invention is directed to solving at least one of the problems of the prior art. Therefore, the invention provides a traffic light cooperative control method, which can obtain better performance on traffic network scales with various scales, particularly on a large-scale traffic network.
In a first aspect, an embodiment of the present invention provides a traffic light cooperation control method, including:
acquiring the states of traffic lights in a preset area to obtain a signal light state set;
taking the signal lamp state set as the input of a preset reinforcement learning model, and taking a preset signal lamp switching state set as the output of the preset reinforcement learning model to obtain an action state value function;
and training the action state value function by adopting a preset training algorithm to obtain a signal lamp control strategy.
The traffic light cooperation control method provided by the embodiment of the invention at least has the following beneficial effects: the states of all traffic lights in the preset area can be obtained by obtaining the states of the traffic lights in the preset area, so that the state of combining a plurality of traffics is realized, the problem of conflict among individual strategies is solved, the signal light state set is used as the input of the preset reinforcement learning model, the preset signal light switching state set is used as the output of the preset reinforcement learning model, the action state value function can be obtained according to the signal light state set and the preset signal light switching state set, and the action state value function is substituted into the preset training algorithm to obtain the approximate global optimal strategy, so that the performance of traffic light cooperative control is improved.
According to another embodiment of the traffic light cooperation control method, the using the set of signal light states as an input of a preset reinforcement learning model, and using the set of preset signal light switching states as an output of the preset reinforcement learning model includes:
taking the signal lamp state set as the input of a preset reinforcement learning model to obtain a characteristic vector;
obtaining a cooperation vector according to a preset local attention model and the feature vector;
and obtaining the preset signal lamp switching state set according to the characteristic vector and the cooperation vector.
According to another embodiment of the invention, a traffic light cooperation control method, where the signal light state set is used as an input of a preset reinforcement learning model, and the preset signal light switching state set is used as an output of the preset reinforcement learning model, so as to obtain an action state cost function, includes:
taking the signal lamp state set as the input of the preset reinforcement learning model;
taking a preset signal lamp switching state set as the output of the preset reinforcement learning model to obtain a reward value and a reward attenuation coefficient;
and obtaining the action state cost function according to the reward value, the reward attenuation coefficient, the preset signal lamp switching state set and the signal lamp state set.
According to the traffic light cooperative control method of other embodiments of the present invention, the preset training algorithm includes: a gradient descent training algorithm.
According to another embodiment of the traffic light cooperative control method, the training of the action state cost function by using a preset training algorithm to obtain a signal light control strategy includes:
training the action state value function by adopting a gradient descent training algorithm to obtain a value parameter;
and substituting the value parameters into the action state value function, and determining a preset signal lamp switching state set as a signal lamp control strategy.
According to the traffic light cooperative control method of another embodiment of the present invention, the training of the action state cost function by using a preset training algorithm to obtain a signal light control strategy further includes:
and updating the value parameters of the action state value function periodically according to a preset updating mode.
According to the traffic light cooperative control method of another embodiment of the present invention, the training of the action state cost function by using a preset training algorithm to obtain a signal light control strategy further includes:
substituting the signal lamp control strategy into the action state value function to obtain a predicted action value;
acquiring a signal lamp controlled according to a signal control strategy to obtain an actual target action value;
and determining a loss function according to the predicted action value and the target action value.
In a second aspect, an embodiment of the present invention provides a traffic light cooperation control apparatus: the method comprises the following steps:
the state acquisition module is used for acquiring the states of the traffic lights in the preset area to obtain a signal light state set;
the reinforcement learning module is used for taking the signal lamp state set as the input of a preset reinforcement learning model and taking a preset signal lamp switching state set as the output of the preset reinforcement learning model to obtain an action state value function;
and the training module is used for training the action state value function by adopting a preset training algorithm to obtain a signal lamp control strategy.
The traffic light cooperation control device provided by the embodiment of the invention at least has the following beneficial effects: the states of the traffic lights in the preset area are acquired through the state acquisition module, the states of all the traffic lights in the area can be acquired, the integrity of the collected state information is guaranteed, the signal light state set is used as the input of the preset reinforcement learning model through the reinforcement learning module, the preset signal light switching state set is used as the output of the preset reinforcement learning model, the reinforcement learning module can also acquire an action state value function according to the signal light state set and the preset signal light switching state set, the training module inputs the action state value function into a preset training algorithm to acquire an approximate global optimal strategy, and the performance of traffic light cooperative control is improved.
In a third aspect, an embodiment of the present invention provides a traffic-light cooperative control apparatus: the method comprises the following steps:
at least one processor, and,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the traffic lamp cooperation control method according to the first aspect.
In a fourth aspect, the embodiment of the present invention further provides a computer-readable storage medium, where computer-executable instructions are stored, and the computer-executable instructions are configured to cause a computer to execute the traffic light cooperation control method according to the first aspect.
Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and drawings.
Drawings
FIG. 1 is a flow chart illustrating a method for cooperative control of traffic lights according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating an embodiment of step S200 of FIG. 1;
FIG. 3 is a schematic flow chart of another embodiment of step S200 in FIG. 1;
FIG. 4 is a flowchart illustrating an embodiment of step S300 of FIG. 1;
FIG. 5 is a schematic flow chart of another embodiment of step S300 in FIG. 1;
FIG. 6 is a schematic flow chart illustrating another embodiment of step S300 in FIG. 1;
fig. 7 is a block diagram of a traffic light cooperative control apparatus according to an embodiment of the present invention.
Description of the drawings:
the system comprises a state acquisition module 100, a reinforcement learning module 200 and a training module 300.
Detailed Description
The concept and technical effects of the present invention will be clearly and completely described below in conjunction with the embodiments to fully understand the objects, features and effects of the present invention. It is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments, and those skilled in the art can obtain other embodiments without inventive effort based on the embodiments of the present invention, and all embodiments are within the protection scope of the present invention.
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
It should be noted that although functional block divisions are provided in the system drawings and logical orders are shown in the flowcharts, in some cases, the steps shown and described may be performed in different orders than the block divisions in the systems or in the flowcharts.
In the description of the present invention, unless otherwise explicitly limited, terms such as arrangement, installation, connection and the like should be understood in a broad sense, and those skilled in the art can reasonably determine the specific meanings of the above terms in the present invention in combination with the specific contents of the technical solutions.
First, several terms referred to in the present application are resolved:
sigmoid activation function: is a sigmoid function commonly seen in biology, also called sigmoid growth curve. In information science, due to the nature of single increment and single increment of an inverse function, a sigmoid function is often used as an activation function of a neural network, and variables are mapped between [0,1 ].
Local attention model: the local attention model comprises a local selection module and an attention module, the local selection module is constructed by the local selection mechanism, the attention module is constructed by the attention mechanism, and the attention mechanism simulates a human observation mode. Generally, when we observe a scene, we firstly observe the whole scene, but when we need to know a certain target deeply, we focus on the target, and even we approach the target to observe the texture of the target, we observe carefully. Similarly, in deep learning, the extracted information flows backwards with equal importance, and if some prior information is known, the flow of some invalid information can be inhibited according to the information, so that the important information is retained.
softmax function: the method is a softmax logistic regression model, is a popularization of logistic regression model in a multi-classification problem, and in the multi-classification problem, a class label y can take more than two values, for example: after the information is subjected to softmax, the probability of each logit is obtained, which can be regarded as the importance degree of the model to each logit, and the larger the probability is, the more beneficial the information is to the model is.
Activation function: is in the artificial spiritThe functions run on the network neurons are responsible for mapping the input of the neurons to the output, the activation functions play an important role in learning and understanding very complex and nonlinear functions of an artificial neural network model, the activation functions are introduced into the network, and the non-linear characteristics are introduced to increase the nonlinearity of the neural network model. The nonlinear activation function tanh is one of hyperbolic functions, tanh () is hyperbolic tangent, and in mathematics, hyperbolic tangent "tanh" is derived from a basic hyperbolic function hyperbolic sine and hyperbolic cosine, and the formula is as follows: (x) ═ ex-e-x)/(ex+e-x)。
Bellman equation: the method refers to a Bellman equation, which is also called a dynamic programming equation and is a necessary condition for achieving optimization by dynamically programming the mathematical optimization methods. This equation expresses "what value the decision problem is at a particular time" in terms of "the reward from the initial selection over the value of the decision problem derived from the initial selection". This way, the dynamic optimization problem is changed into simple sub-problems, which obey the "optimization principle" proposed by bellman.
Referring to fig. 1, a flow chart of a traffic light cooperation control method in an embodiment of the invention is shown. The embodiment discloses a traffic light cooperation control method, which specifically includes, but is not limited to, including step S100 to step S300.
Step S100, acquiring the states of traffic lights in a preset area to obtain a signal light state set;
firstly, setting an area to be predicted as a preset area, then obtaining the states of all traffic lights according to the obtained traffic light state of each traffic light in the preset area, and finally establishing a traffic light state set according to the obtained states of the traffic lights.
Specifically, the signal lamp state set is St={s1 t,s2 t,…,sN tN is the number of intersections containing traffic lights, and Si tWhen the ith traffic light is onThe state of step t. The state of the traffic lights comprises the signal phase of the current traffic light and the number of vehicles on each lane connected with the intersection of the traffic lights, wherein the signal phase of the traffic lights refers to the display state of signal groups corresponding to one or more traffic flows which simultaneously obtain the right of way, and specifically refers to that one or more traffic flows obtain the identical signal light color display at any moment in a signal period, then the continuous time sequence of obtaining different light colors (comprising green light, yellow light and full red) is called as a signal phase, and each signal phase periodically and alternately obtains the green light display, namely the right of way passing through the intersection, each transition of the right of way is called as a signal phase stage, and one signal period is formed by the sum of all preset phase time segments.
Step S200, using the signal lamp state set as the input of a preset reinforcement learning model, and using the preset signal lamp switching state set as the output of the preset reinforcement learning model to obtain an action state value function;
in step S200, the signal lamp state set is input to the preset reinforcement learning model, the signal lamp state set can obtain the preset signal lamp switching state set after reinforcement learning, the preset reinforcement learning model outputs the preset signal lamp switching state set, and the signal lamp state set and the preset signal lamp switching state set can construct an action state cost function.
And step S300, training the action state value function by adopting a preset training algorithm to obtain a signal lamp control strategy.
And repeating the step S100 and the step S200 through a preset training algorithm to obtain a plurality of groups of signal lamp state sets and a plurality of groups of preset signal lamp switching state sets, and obtaining a plurality of action state value functions, then training the action state value functions for a plurality of times, and obtaining a signal lamp control strategy when the training times are maximum.
Referring to fig. 2, a flow chart of a traffic light cooperation control method in an embodiment of the invention is shown. The embodiment discloses a traffic light cooperation control method, which specifically includes, but is not limited to, steps S210 to S230.
Step S210, using the signal lamp state set as the input of a preset reinforcement learning model to obtain a characteristic vector;
in step S210, the signal lamp state set is input to the preset reinforcement learning model, and each signal lamp state in the signal lamp state set passes through a feature coding layer of the preset reinforcement learning model, where the feature coding layer is a first layer of the preset reinforcement learning model, and the feature coding layer performs feature extraction on each signal lamp state and can obtain a feature vector corresponding to each signal lamp state.
It should be noted that the preset reinforcement learning model is named as Attn-CommNet model, and the first layer of the Attn-CommNet model is set as a feature encoding layer in the form of a single-layer neural network with a sigmoid activation function, which is used for receiving the signal lamp state set, encoding the signal lamp state set, and finally outputting the feature vectors to fi jThe module correspondingly generates and outputs a plurality of characteristic vectors according to the input number of the plurality of signal lamp states, for example: input { s1,s2,…,si,…,sNN total input signal lamp state numbers, then N eigenvectors, the eigenvector { h }1 0,h20,…,hi 0,…,hN 0H of1 0Is based on the state of the signal lamp1Correspondingly generated other feature vectors are correspondingly generated according to the signal lamp states with the same subscript, and then the generated feature vectors { h1 0,h2 0,…,hi 0,…,hN 0Is correspondingly output to fi jModules, i.e. will h1 0Output to f1 0Module, other feature vectors are also indexed the same and superscript the same fi jAnd outputting the corresponding module. Wherein the feature coding is to perform feature vectorThe working process of the coding.
Step S220, obtaining a cooperation vector according to a preset local attention model and a feature vector;
in step S220, each feature vector h is transformed into a vectori jAfter the feature vector is output to a preset local attention model, the feature vector firstly needs to pass through a local selection module in the local attention model, and the local selection module can perform feature vector hi jIs locally selected and can be found to be NiFeature vectors of neighboring states, then NiThe feature vectors of the adjacent states are input into an attention module in the local attention model, and the attention module can acquire each feature vector hi jCorresponding cooperation vector ci j
It should be noted that the feature vector is output to the preset local attention model, that is, the feature vector h is output to the preset local attention modeli jOutputting the data to a local attention model, and obtaining a cooperation vector c through calculationi jAnd outputs the cooperation vector to the corresponding fi jAnd the module, namely the local attention model is a model comprising a local selection module and an attention module, and the second layer of the preset strengthening model and other layers behind the second layer are constructed by the local attention model. Feature vector hi jCollaboration vector ci jAnd fi jThe module is marked with the corresponding layer number, and when j is 0, the module is 0 th layer, namely { f obtained after the signal lamp state passes through the characteristic coding layer1 0,f2 0,…,fi 0,…,fN 0}。
First, a feature vector hi jOutput to a local selection module for cascaded vectors h from all individuals1,h2,…,hNOf each individual i, selects an adjacent state hkAnd k ∈ NiAfter local selection, the attention module takes the concatenated vector of the neighboring individuals as input. To obtain adjacent individual pair determinationTo determine the degree of influence of individual policies, in the attention module, we first define a weighting parameter WTAnd WNDegree of influence e of an individual k on the strategy of an individual i<i,k>Can be determined by the corresponding formula: e.g. of the type<i,k>=(hiWT)(hkWN)TCalculation is performed to obtain, after that, by normalizing the degree of influence of the individual k on the strategy of the individual i by means of the softmax function, and the normalized degree of influence can be obtained: a is<i,k>=softmax(e<i,k>)=[exp(e<i,k>/τ)]/[Σk∈Niexp(e<i,k>/τ)]According to the obtained standard influence degree a<i,k>The corresponding cooperation vector c can be obtained by the characteristic vector of the individual ki jCollaboration vector ci jBy the formula: c. Ci j=Σk∈Ni*a<i,k>*hk jAnd (4) calculating. Wherein the degree of influence e of the strategy<i,k>N in the corresponding formulaiIs a set of adjacent individuals of the individual i, determined by the geographical distance; normalized degree of influence a<i,k>τ in the correlation equation is a constant coefficient.
And step S230, obtaining a preset signal lamp switching state set according to the characteristic vector and the cooperation vector.
The feature vector hi jAnd cooperation vector ci jIs inputted into fi jModule and can obtain the feature vector h of the next layeri j +1And through hi j+1And the local attention module of the next layer can obtain the corresponding cooperation vector ci j+1And circularly executing the step S220 and the step S230 until f of the last layer of the model is obtainedi jAnd in the last layer of the model, a decoding layer taking softmax as an activation function is used for outputting the action value distribution of the action space, and the action value distribution is input into a set, so that a preset signal lamp switching state set can be obtained.
In addition, f isi jThe module is a linear neural network with a non-linear activation function tanh, fi jThe module can be expressed by the formula: h isi j+1=fi j(hi j,ci j)=tanh(Hj*hi j+Cj*ci j) In which H isjAnd CjIs the corresponding coefficient matrix.
Referring to fig. 3, a flow chart of a traffic light cooperation control method in an embodiment of the invention is shown. The embodiment discloses a traffic light cooperation control method, which specifically includes, but is not limited to, including step S240 to step S260.
Step S240, using the signal lamp state set as the input of a preset reinforcement learning model;
in step S240, the signal lamp state set is input to the preset reinforcement learning model, and after steps S210, S220, and S230, a preset signal lamp switching state set can be obtained.
Step S250, a preset signal lamp switching state set is used as the output of a preset reinforcement learning model to obtain a reward value and a reward attenuation coefficient;
in step S250, the preset signal lamp switching state set is output, and the signal state set and the preset signal lamp switching state set are substituted into the corresponding formula for digital operation, so as to obtain the corresponding reward value R (S)k,Ak) And a bonus attenuation coefficient gamma.
It should be noted that, the signal state set and the preset signal lamp switching state set are substituted into a corresponding formula to perform digital operation, and the corresponding formula is as follows: the average length of the queue on each lane of the intersection is used as a reward function. The bonus attenuation coefficient gamma is the corresponding bonus value R (S)k,Ak) There is an uncertainty factor, and this uncertainty factor results in a corresponding prize value R (S)k,Ak) Producing a resultant attenuation.
And step S260, obtaining an action state value function according to the reward value, the reward attenuation coefficient, the preset signal lamp switching state set and the signal lamp state set.
It should be noted that in the problem of traffic light cooperative control, the traffic flow is random, cannot be accurately modeled, and can be regarded as a random process. At the same time, the conditional probability distribution of the future state depends only on the current state. Therefore, we model the cooperative traffic light control problem as an MDP, where MDP refers to a markov decision process described by a tuple (S, a, P, R), where: s is a limited state set, A is a limited action set, P is a state transition probability, R is a return function, R is a discount factor and is used for calculating accumulated return, and in addition, the goal of reinforcement learning, namely under a given MDP scene, a strategy is found, so that actions of an intelligent agent in each state are optimal, and the expected total return is maximum. The MDP constructed by the present application is defined by five sections < S, a, P, R, γ >, where S is the state space, a is the space where all available actions are integrated, P is the state transition probability, R is the reward, and γ is the coefficient by which the reward decays over time. The specific introduction is as follows:
the state space, namely the signal lamp state set, is defined as St={s1 t,s2 t,…,sN tWhere N is the number of intersections containing traffic lights, Si tIs the status of the ith traffic light at time step t. The state of the traffic light comprises the signal phase of the current traffic light and the number of vehicles on each lane connected with the intersection of the traffic light, wherein the signal phase of the traffic light refers to the display state of a signal group corresponding to one or more traffic flows which simultaneously obtain the right of way, and specifically refers to the fact that in a signal period, one or more traffic flows obtain the same signal light color display at any moment, then the continuous time sequence of obtaining different light colors (including green light, yellow light and full red) is called as a signal phase, and each signal phase periodically and alternately obtains the green light display, namely the right of way passing through the intersection, each transition of the right of way is called as a signal phase stage, and each signal period is formed by the fact that the right of way passes through the intersection according to all preset phasesThe sum of the time segments.
The action space includes all feasible actions, and the action space is also a preset signal lamp switching state set. In the present application, the action space of the collaborative traffic light control problem is a complete set of all feasible switching states, which can be denoted as at={phase1,phase2,…,phaseNN is the total number of switchable states. At each time step, each individual will select a state from the action space as its action, as the phase at its next instant, the action of individual i at time step t being defined as a1 t∈At. The individual may refer to an object moving on a road, such as a person, an automobile, or an electric vehicle, and is not particularly limited in this application.
The reward space, our goal is to minimize the travel time of all vehicles in the area, which is difficult to optimize directly. Therefore, we use the average length of the queue on each lane of the intersection as the reward function. The bonus space may also be defined as Rt={r1 t,r2 t,…,rN t}。
A policy function parameterized by a combination of formulas: piθ(S, a) ═ p (a | S, θ), and pi is found from the traffic signal state set and the traffic signal switching state setθWhere p is an action probability distribution conditioned on state S and parameter θ, the policy is evaluated by the accumulated expected reward, and an action state cost function Q is derived, by the formula: q (S)t,At;πθ)=EπθT k=tγk-tR(Sk,Ak)]And (4) calculating.
It should be noted that Q is also referred to as an action state cost function and represents the accumulated expected reward obtained by taking a policy, where T represents the mathematical expectation and T is the total time step of the whole evaluation period.
Referring to fig. 4, a flow chart of a traffic light cooperation control method in an embodiment of the invention is shown. The embodiment discloses a traffic light cooperation control method, which specifically includes, but is not limited to, steps S310 to S320.
Step S310, training the motion state value function by adopting a gradient descent training algorithm to obtain a value parameter;
the value parameter omega' can be obtained by training the action state value function Q through a gradient descent training algorithm. The optimization idea of the gradient descent algorithm is to use the direction of the negative gradient at the current position as the search direction, the direction is taken as the fastest descent direction at the current position, and the closer the gradient descent is to the target value, the smaller the variation is.
And step S320, substituting the value parameters into the action state value function, and determining a preset signal lamp switching state set as a signal lamp control strategy.
The method comprises the steps of determining a preset signal lamp switching set as a signal lamp control strategy, namely obtaining an optimal strategy in the preset signal lamp switching set, and obtaining the optimal strategy through interaction with the environment. Therefore, we cannot directly obtain the optimal solution of MDP by Bellman's equation. To solve this problem, we use model-free Q with parameters ω and ωωNetwork and QωωThe optimal strategy is obtained by interacting with the environment. According to the bellman equation, it can be defined as: qωω(St,At)=Rt+γmaxQωω(St+1,At+1) Value of operating state Q at presentωωDependent on the reward value R brought by the current action and the value of the action state at the next moment, i.e. maxQ in the formulaωω(St+1,At+1)。
It should be noted that, the Q network Q is initialized randomly by using the initial weights ω and ωωAnd QωωInitialize the sample buffer R (initially an empty set), perform actions, get the reward R, and get the state transition vector<st,at,st+1,rt>Storing in a sample buffer R, sampling N state transition vectors from the state buffer as training samples,wherein the state buffer is used for storing the state transition vector.
Referring to fig. 5, a flow chart of a traffic light cooperation control method in an embodiment of the invention is shown. The embodiment discloses a traffic light cooperation control method, which specifically includes, but is not limited to including, step S330.
Step S330, periodically updating the value parameters of the action state value function according to a preset updating mode.
Periodically updating Q according to a preset updating modeωω(St,At) Where ω ω ω ═ τ ω + (1- τ) ω ω, thereby improving learning stability.
Referring to fig. 6, a flow chart of a traffic light cooperation control method in an embodiment of the invention is shown. The embodiment discloses a traffic light cooperation control method, which specifically includes, but is not limited to, steps S340 to S360.
Step S340, substituting the signal control strategy into the action state value function to obtain a predicted action value;
the predicted action value Q can be obtained by substituting the signal control strategy obtained in step S320 into the action state cost function to perform digital operationω(St,At)。
Step S350, acquiring a signal lamp controlled according to a signal control strategy to obtain an actual target action value;
the signal lamp is controlled according to the signal control strategy obtained in step S320, so that the actual target action value Q can be obtainedωω(St,At)。
Step S360, determining a loss function according to the predicted action value and the target action value.
Predicted operation value Q according to step S340ω(St,At) And the target operation value Q of step S350ωω(St,At) Determining a loss function, the loss function being to minimize a squared difference between the target action value and the predicted action value, defined as: l ═ E [ (Q)ωω(St,At)-Qω(St,At))2]。
It should be noted that the loss function is a comparison result of the minimized target motion value and the predicted motion value, and by comparing the predicted motion values of all the predicted motions with the minimized target motion value, a comparison result corresponding to each predicted motion can be obtained, and an optimal predicted motion can be obtained according to the magnitude of the comparison result, and the optimal predicted motion is the traffic light control strategy. Wherein, all the predicted actions include the predicted action of each signal lamp road in the preset area, and the minimized target action value is continuously updated with the parameters according to step S330.
Referring to fig. 7, another embodiment of the present invention discloses a traffic light cooperation control apparatus including: the system comprises a state acquisition module 100, a reinforcement learning module 200 and a training module 300, wherein the state acquisition module 100, the reinforcement learning module 200 and the training module 300 are all in communication connection. The state acquisition module 100 is configured to acquire states of traffic lights in a preset area to obtain a signal light state set; the reinforcement learning module 200 is configured to use the signal lamp state set as an input of a preset reinforcement learning model, and use the preset signal lamp switching state set as an output of the preset reinforcement learning model to obtain an action state value function; the training module 300 is configured to train the action state cost function by using a preset training algorithm to obtain a signal lamp control strategy.
Firstly, the state of the traffic lights in the preset area is acquired through the state acquisition module 100, the states of all the traffic lights in the area can be acquired, the integrity of the collected state information is ensured, then, the reinforcement learning module 200 takes the signal light state set as the input of the preset reinforcement learning model, and takes the preset signal light switching state set as the output of the preset reinforcement learning model, secondly, the reinforcement learning module 200 can also acquire an action state value function according to the signal light state set and the preset signal light switching state set, and finally, the training module 300 inputs the action state value function into a preset training algorithm to obtain an approximate global optimal strategy, so that the performance of the traffic light cooperative control is improved.
The specific operation process of the traffic light cooperation control device refers to the above traffic light cooperation control method, and is not described herein again.
The above-described embodiments of the apparatus are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may also be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
One of ordinary skill in the art will appreciate that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.
The embodiments of the present invention have been described in detail with reference to the accompanying drawings, but the present invention is not limited to the above embodiments, and various changes can be made within the knowledge of those skilled in the art without departing from the gist of the present invention. Furthermore, the embodiments of the present invention and the features of the embodiments may be combined with each other without conflict.

Claims (10)

1. A traffic light cooperative control method characterized by comprising:
acquiring the states of traffic lights in a preset area to obtain a signal light state set;
taking the signal lamp state set as the input of a preset reinforcement learning model, and taking a preset signal lamp switching state set as the output of the preset reinforcement learning model to obtain an action state value function;
and training the action state value function by adopting a preset training algorithm to obtain a signal lamp control strategy.
2. The traffic light cooperation control method according to claim 1, wherein the taking the set of signal light states as an input of a preset reinforcement learning model and the set of preset signal light switching states as an output of the preset reinforcement learning model comprises:
taking the signal lamp state set as the input of the preset reinforcement learning model to obtain a feature vector;
obtaining a cooperation vector according to a preset local attention model and the feature vector;
and obtaining the preset signal lamp switching state set according to the characteristic vector and the cooperation vector.
3. The traffic light cooperation control method according to claim 1, wherein the taking the signal light state set as an input of a preset reinforcement learning model and the taking the preset signal light switching state set as an output of the preset reinforcement learning model to obtain the action state value function comprises:
taking the signal lamp state set as the input of the preset reinforcement learning model;
taking a preset signal lamp switching state set as the output of the preset reinforcement learning model to obtain a reward value and a reward attenuation coefficient;
and obtaining the action state cost function according to the reward value, the reward attenuation coefficient, the preset signal lamp switching state set and the signal lamp state set.
4. The traffic light cooperative control method according to claim 1, wherein the preset training algorithm comprises: a gradient descent training algorithm.
5. The traffic light cooperative control method according to claim 3, wherein the training the action state cost function by using a preset training algorithm to obtain a signal light control strategy comprises:
training the action state value function by adopting a gradient descent training algorithm to obtain a value parameter;
and substituting the value parameters into the action state value function, and determining a preset signal lamp switching state set as a signal lamp control strategy.
6. The traffic light cooperative control method according to claim 5, wherein the training of the action state cost function by using a preset training algorithm to obtain a signal light control strategy further comprises:
and updating the value parameters of the action state value function periodically according to a preset updating mode.
7. The traffic light cooperative control method according to claim 1, wherein the training of the action state cost function using a preset training algorithm to obtain a signal light control strategy further comprises:
substituting the signal lamp control strategy into the action state value function to obtain a predicted action value;
acquiring a signal lamp controlled according to a signal control strategy to obtain an actual target action value;
and determining a loss function according to the predicted action value and the target action value.
8. Traffic light cooperative control apparatus, characterized by comprising:
the state acquisition module is used for acquiring the states of the traffic lights in the preset area to obtain a signal light state set;
the reinforcement learning module is used for taking the signal lamp state set as the input of a preset reinforcement learning model and taking a preset signal lamp switching state set as the output of the preset reinforcement learning model to obtain an action state value function;
and the training module is used for training the action state value function by adopting a preset training algorithm to obtain a signal lamp control strategy.
9. Traffic light cooperative control apparatus characterized by comprising:
at least one processor, and,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the traffic lamp cooperation control method of any one of claims 1 to 7.
10. A computer-readable storage medium storing computer-executable instructions for causing a computer to execute the traffic lamp cooperation control method according to any one of claims 1 to 7.
CN202111321571.1A 2021-11-09 2021-11-09 Traffic light cooperation control method, device, equipment and computer readable storage medium Pending CN114141033A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111321571.1A CN114141033A (en) 2021-11-09 2021-11-09 Traffic light cooperation control method, device, equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111321571.1A CN114141033A (en) 2021-11-09 2021-11-09 Traffic light cooperation control method, device, equipment and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN114141033A true CN114141033A (en) 2022-03-04

Family

ID=80392881

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111321571.1A Pending CN114141033A (en) 2021-11-09 2021-11-09 Traffic light cooperation control method, device, equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN114141033A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111629218A (en) * 2020-04-29 2020-09-04 南京邮电大学 Accelerated reinforcement learning edge caching method based on time-varying linearity in VANET
CN112632858A (en) * 2020-12-23 2021-04-09 浙江工业大学 Traffic light signal control method based on Actor-critical frame deep reinforcement learning algorithm
CN112669629A (en) * 2020-12-17 2021-04-16 北京建筑大学 Real-time traffic signal control method and device based on deep reinforcement learning
US20210197855A1 (en) * 2018-12-13 2021-07-01 Huawei Technologies Co., Ltd. Self-Driving Method, Training Method, and Related Apparatus

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210197855A1 (en) * 2018-12-13 2021-07-01 Huawei Technologies Co., Ltd. Self-Driving Method, Training Method, and Related Apparatus
CN111629218A (en) * 2020-04-29 2020-09-04 南京邮电大学 Accelerated reinforcement learning edge caching method based on time-varying linearity in VANET
CN112669629A (en) * 2020-12-17 2021-04-16 北京建筑大学 Real-time traffic signal control method and device based on deep reinforcement learning
CN112632858A (en) * 2020-12-23 2021-04-09 浙江工业大学 Traffic light signal control method based on Actor-critical frame deep reinforcement learning algorithm

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JIASHI GAO: "Attn-CommNet: Coordinated Traffic Lights Control", 《2021 IEEE 33RD INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI)》 *
孔松涛: "深度强化学习在智能制造中的应用展望综述", 《计算机工程与应用》 *

Similar Documents

Publication Publication Date Title
CN109635917B (en) Multi-agent cooperation decision and training method
CN111061277B (en) Unmanned vehicle global path planning method and device
Mousavi et al. Traffic light control using deep policy‐gradient and value‐function‐based reinforcement learning
Zhu et al. Human-like autonomous car-following model with deep reinforcement learning
CN110766038B (en) Unsupervised landform classification model training and landform image construction method
Coşkun et al. Deep reinforcement learning for traffic light optimization
CN113257016B (en) Traffic signal control method and device and readable storage medium
Chu et al. Traffic signal control using end-to-end off-policy deep reinforcement learning
CN114333357B (en) Traffic signal control method and device, electronic equipment and storage medium
CN115018039A (en) Neural network distillation method, target detection method and device
CN115951587A (en) Automatic driving control method, device, equipment, medium and automatic driving vehicle
Jaafra et al. Context-aware autonomous driving using meta-reinforcement learning
Li et al. Cycle-based signal timing with traffic flow prediction for dynamic environment
CN111507499B (en) Method, device and system for constructing model for prediction and testing method
CN116758767B (en) Traffic signal lamp control method based on multi-strategy reinforcement learning
Jiang et al. A general scenario-agnostic reinforcement learning for traffic signal control
Kalweit et al. Deep surrogate Q-learning for autonomous driving
CN116861262A (en) Perception model training method and device, electronic equipment and storage medium
CN114141033A (en) Traffic light cooperation control method, device, equipment and computer readable storage medium
CN109697511B (en) Data reasoning method and device and computer equipment
CN115965144A (en) Ship traffic flow prediction method, system, device and storage medium
Lee Differentiable sparsification for deep neural networks
CN116258253A (en) Vehicle OD prediction method based on Bayesian neural network
CN115630361A (en) Attention distillation-based federal learning backdoor defense method
Zhao et al. A survey on deep reinforcement learning approaches for traffic signal control

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20220304

RJ01 Rejection of invention patent application after publication