CN114141033A

CN114141033A - Traffic light cooperation control method, device, equipment and computer readable storage medium

Info

Publication number: CN114141033A
Application number: CN202111321571.1A
Authority: CN
Inventors: 余剑峤; 高嘉时
Original assignee: Southwest University of Science and Technology
Current assignee: Southwest University of Science and Technology
Priority date: 2021-11-09
Filing date: 2021-11-09
Publication date: 2022-03-04

Abstract

The invention discloses a traffic light cooperation control method, a traffic light cooperation control device, traffic light cooperation control equipment and a computer readable storage medium. The invention inputs the state of the traffic light into the reinforcement learning model for reinforcement learning, and after a plurality of times of training by the training algorithm, the invention can obtain better performance on the traffic network scales of various scales, especially on a large-scale traffic network.

Description

Traffic light cooperation control method, device, equipment and computer readable storage medium

Technical Field

The invention relates to the technical field of reinforcement learning, in particular to a traffic light cooperation control method, a device, equipment and a computer readable storage medium.

Background

At present, traffic congestion in urban areas has caused serious problems, such as long waiting time, high fuel consumption, and increased emission of harmful gases. One of the effective methods to address congestion is to control the traffic lights more intelligently. Since control strategies between the signal lights of an intersection are highly interdependent, cooperative control of traffic signals in a large area is crucial.

Reinforcement Learning (RL) techniques are an effective method of cooperatively controlling traffic signals. In earlier approaches, the information and parameters were not shared between individual traffic lights, but rather the own control network was updated independently. Such a distributed control method causes a contradiction between the individual optimal strategy and the global optimal strategy, which in turn leads to non-stationarity of strategy convergence. The centralized RL solves the problem of conflict among individual strategies by generating a control strategy through the joint state among a plurality of signal lamps. However, the joint state space of the global individual is large in dimension, which causes a computational burden in time and space. Another advanced RL method, coight, uses a local signal that is designed to note that the network selects a neighbor to participate in the decision of the target signal. This network avoids the dimension problem, but the neighboring individuals for communication are predefined and fixed, making it difficult to use a dynamically changing traffic environment.

Disclosure of Invention

The present invention is directed to solving at least one of the problems of the prior art. Therefore, the invention provides a traffic light cooperative control method, which can obtain better performance on traffic network scales with various scales, particularly on a large-scale traffic network.

In a first aspect, an embodiment of the present invention provides a traffic light cooperation control method, including:

acquiring the states of traffic lights in a preset area to obtain a signal light state set;

taking the signal lamp state set as the input of a preset reinforcement learning model, and taking a preset signal lamp switching state set as the output of the preset reinforcement learning model to obtain an action state value function;

and training the action state value function by adopting a preset training algorithm to obtain a signal lamp control strategy.

The traffic light cooperation control method provided by the embodiment of the invention at least has the following beneficial effects: the states of all traffic lights in the preset area can be obtained by obtaining the states of the traffic lights in the preset area, so that the state of combining a plurality of traffics is realized, the problem of conflict among individual strategies is solved, the signal light state set is used as the input of the preset reinforcement learning model, the preset signal light switching state set is used as the output of the preset reinforcement learning model, the action state value function can be obtained according to the signal light state set and the preset signal light switching state set, and the action state value function is substituted into the preset training algorithm to obtain the approximate global optimal strategy, so that the performance of traffic light cooperative control is improved.

According to another embodiment of the traffic light cooperation control method, the using the set of signal light states as an input of a preset reinforcement learning model, and using the set of preset signal light switching states as an output of the preset reinforcement learning model includes:

taking the signal lamp state set as the input of a preset reinforcement learning model to obtain a characteristic vector;

obtaining a cooperation vector according to a preset local attention model and the feature vector;

and obtaining the preset signal lamp switching state set according to the characteristic vector and the cooperation vector.

According to another embodiment of the invention, a traffic light cooperation control method, where the signal light state set is used as an input of a preset reinforcement learning model, and the preset signal light switching state set is used as an output of the preset reinforcement learning model, so as to obtain an action state cost function, includes:

taking the signal lamp state set as the input of the preset reinforcement learning model;

taking a preset signal lamp switching state set as the output of the preset reinforcement learning model to obtain a reward value and a reward attenuation coefficient;

and obtaining the action state cost function according to the reward value, the reward attenuation coefficient, the preset signal lamp switching state set and the signal lamp state set.

According to the traffic light cooperative control method of other embodiments of the present invention, the preset training algorithm includes: a gradient descent training algorithm.

According to another embodiment of the traffic light cooperative control method, the training of the action state cost function by using a preset training algorithm to obtain a signal light control strategy includes:

training the action state value function by adopting a gradient descent training algorithm to obtain a value parameter;

and substituting the value parameters into the action state value function, and determining a preset signal lamp switching state set as a signal lamp control strategy.

According to the traffic light cooperative control method of another embodiment of the present invention, the training of the action state cost function by using a preset training algorithm to obtain a signal light control strategy further includes:

and updating the value parameters of the action state value function periodically according to a preset updating mode.

substituting the signal lamp control strategy into the action state value function to obtain a predicted action value;

acquiring a signal lamp controlled according to a signal control strategy to obtain an actual target action value;

and determining a loss function according to the predicted action value and the target action value.

In a second aspect, an embodiment of the present invention provides a traffic light cooperation control apparatus: the method comprises the following steps:

the state acquisition module is used for acquiring the states of the traffic lights in the preset area to obtain a signal light state set;

the reinforcement learning module is used for taking the signal lamp state set as the input of a preset reinforcement learning model and taking a preset signal lamp switching state set as the output of the preset reinforcement learning model to obtain an action state value function;

and the training module is used for training the action state value function by adopting a preset training algorithm to obtain a signal lamp control strategy.

The traffic light cooperation control device provided by the embodiment of the invention at least has the following beneficial effects: the states of the traffic lights in the preset area are acquired through the state acquisition module, the states of all the traffic lights in the area can be acquired, the integrity of the collected state information is guaranteed, the signal light state set is used as the input of the preset reinforcement learning model through the reinforcement learning module, the preset signal light switching state set is used as the output of the preset reinforcement learning model, the reinforcement learning module can also acquire an action state value function according to the signal light state set and the preset signal light switching state set, the training module inputs the action state value function into a preset training algorithm to acquire an approximate global optimal strategy, and the performance of traffic light cooperative control is improved.

In a third aspect, an embodiment of the present invention provides a traffic-light cooperative control apparatus: the method comprises the following steps:

at least one processor, and,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the traffic lamp cooperation control method according to the first aspect.

In a fourth aspect, the embodiment of the present invention further provides a computer-readable storage medium, where computer-executable instructions are stored, and the computer-executable instructions are configured to cause a computer to execute the traffic light cooperation control method according to the first aspect.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and drawings.

Drawings

FIG. 1 is a flow chart illustrating a method for cooperative control of traffic lights according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating an embodiment of step S200 of FIG. 1;

FIG. 3 is a schematic flow chart of another embodiment of step S200 in FIG. 1;

FIG. 4 is a flowchart illustrating an embodiment of step S300 of FIG. 1;

FIG. 5 is a schematic flow chart of another embodiment of step S300 in FIG. 1;

FIG. 6 is a schematic flow chart illustrating another embodiment of step S300 in FIG. 1;

fig. 7 is a block diagram of a traffic light cooperative control apparatus according to an embodiment of the present invention.

Description of the drawings:

the system comprises a state acquisition module 100, a reinforcement learning module 200 and a training module 300.

Detailed Description

The concept and technical effects of the present invention will be clearly and completely described below in conjunction with the embodiments to fully understand the objects, features and effects of the present invention. It is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments, and those skilled in the art can obtain other embodiments without inventive effort based on the embodiments of the present invention, and all embodiments are within the protection scope of the present invention.

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

It should be noted that although functional block divisions are provided in the system drawings and logical orders are shown in the flowcharts, in some cases, the steps shown and described may be performed in different orders than the block divisions in the systems or in the flowcharts.

In the description of the present invention, unless otherwise explicitly limited, terms such as arrangement, installation, connection and the like should be understood in a broad sense, and those skilled in the art can reasonably determine the specific meanings of the above terms in the present invention in combination with the specific contents of the technical solutions.

First, several terms referred to in the present application are resolved:

sigmoid activation function: is a sigmoid function commonly seen in biology, also called sigmoid growth curve. In information science, due to the nature of single increment and single increment of an inverse function, a sigmoid function is often used as an activation function of a neural network, and variables are mapped between [0,1 ].

Local attention model: the local attention model comprises a local selection module and an attention module, the local selection module is constructed by the local selection mechanism, the attention module is constructed by the attention mechanism, and the attention mechanism simulates a human observation mode. Generally, when we observe a scene, we firstly observe the whole scene, but when we need to know a certain target deeply, we focus on the target, and even we approach the target to observe the texture of the target, we observe carefully. Similarly, in deep learning, the extracted information flows backwards with equal importance, and if some prior information is known, the flow of some invalid information can be inhibited according to the information, so that the important information is retained.

softmax function: the method is a softmax logistic regression model, is a popularization of logistic regression model in a multi-classification problem, and in the multi-classification problem, a class label y can take more than two values, for example: after the information is subjected to softmax, the probability of each logit is obtained, which can be regarded as the importance degree of the model to each logit, and the larger the probability is, the more beneficial the information is to the model is.

Activation function: is in the artificial spiritThe functions run on the network neurons are responsible for mapping the input of the neurons to the output, the activation functions play an important role in learning and understanding very complex and nonlinear functions of an artificial neural network model, the activation functions are introduced into the network, and the non-linear characteristics are introduced to increase the nonlinearity of the neural network model. The nonlinear activation function tanh is one of hyperbolic functions, tanh () is hyperbolic tangent, and in mathematics, hyperbolic tangent "tanh" is derived from a basic hyperbolic function hyperbolic sine and hyperbolic cosine, and the formula is as follows: (x) ═ e^x-e^-x)/(e^x+e^-x)。

Bellman equation: the method refers to a Bellman equation, which is also called a dynamic programming equation and is a necessary condition for achieving optimization by dynamically programming the mathematical optimization methods. This equation expresses "what value the decision problem is at a particular time" in terms of "the reward from the initial selection over the value of the decision problem derived from the initial selection". This way, the dynamic optimization problem is changed into simple sub-problems, which obey the "optimization principle" proposed by bellman.

Referring to fig. 1, a flow chart of a traffic light cooperation control method in an embodiment of the invention is shown. The embodiment discloses a traffic light cooperation control method, which specifically includes, but is not limited to, including step S100 to step S300.

Step S100, acquiring the states of traffic lights in a preset area to obtain a signal light state set;

firstly, setting an area to be predicted as a preset area, then obtaining the states of all traffic lights according to the obtained traffic light state of each traffic light in the preset area, and finally establishing a traffic light state set according to the obtained states of the traffic lights.

Specifically, the signal lamp state set is S^t＝{s₁ ^t，s₂ ^t，…，s_N ^tN is the number of intersections containing traffic lights, and S_i ^tWhen the ith traffic light is onThe state of step t. The state of the traffic lights comprises the signal phase of the current traffic light and the number of vehicles on each lane connected with the intersection of the traffic lights, wherein the signal phase of the traffic lights refers to the display state of signal groups corresponding to one or more traffic flows which simultaneously obtain the right of way, and specifically refers to that one or more traffic flows obtain the identical signal light color display at any moment in a signal period, then the continuous time sequence of obtaining different light colors (comprising green light, yellow light and full red) is called as a signal phase, and each signal phase periodically and alternately obtains the green light display, namely the right of way passing through the intersection, each transition of the right of way is called as a signal phase stage, and one signal period is formed by the sum of all preset phase time segments.

Step S200, using the signal lamp state set as the input of a preset reinforcement learning model, and using the preset signal lamp switching state set as the output of the preset reinforcement learning model to obtain an action state value function;

in step S200, the signal lamp state set is input to the preset reinforcement learning model, the signal lamp state set can obtain the preset signal lamp switching state set after reinforcement learning, the preset reinforcement learning model outputs the preset signal lamp switching state set, and the signal lamp state set and the preset signal lamp switching state set can construct an action state cost function.

And step S300, training the action state value function by adopting a preset training algorithm to obtain a signal lamp control strategy.

And repeating the step S100 and the step S200 through a preset training algorithm to obtain a plurality of groups of signal lamp state sets and a plurality of groups of preset signal lamp switching state sets, and obtaining a plurality of action state value functions, then training the action state value functions for a plurality of times, and obtaining a signal lamp control strategy when the training times are maximum.

Referring to fig. 2, a flow chart of a traffic light cooperation control method in an embodiment of the invention is shown. The embodiment discloses a traffic light cooperation control method, which specifically includes, but is not limited to, steps S210 to S230.

Step S210, using the signal lamp state set as the input of a preset reinforcement learning model to obtain a characteristic vector;

in step S210, the signal lamp state set is input to the preset reinforcement learning model, and each signal lamp state in the signal lamp state set passes through a feature coding layer of the preset reinforcement learning model, where the feature coding layer is a first layer of the preset reinforcement learning model, and the feature coding layer performs feature extraction on each signal lamp state and can obtain a feature vector corresponding to each signal lamp state.

It should be noted that the preset reinforcement learning model is named as Attn-CommNet model, and the first layer of the Attn-CommNet model is set as a feature encoding layer in the form of a single-layer neural network with a sigmoid activation function, which is used for receiving the signal lamp state set, encoding the signal lamp state set, and finally outputting the feature vectors to f_i ^jThe module correspondingly generates and outputs a plurality of characteristic vectors according to the input number of the plurality of signal lamp states, for example: input { s₁，s₂，…，s_i，…，s_NN total input signal lamp state numbers, then N eigenvectors, the eigenvector { h }₁ ⁰，h₂0，…，h_i ⁰，…，h_N ⁰H of₁ ⁰Is based on the state of the signal lamp₁Correspondingly generated other feature vectors are correspondingly generated according to the signal lamp states with the same subscript, and then the generated feature vectors { h₁ ⁰，h₂ ⁰，…，h_i ⁰，…，h_N ⁰Is correspondingly output to f_i ^jModules, i.e. will h₁ ⁰Output to f₁ ⁰Module, other feature vectors are also indexed the same and superscript the same f_i ^jAnd outputting the corresponding module. Wherein the feature coding is to perform feature vectorThe working process of the coding.

Step S220, obtaining a cooperation vector according to a preset local attention model and a feature vector;

in step S220, each feature vector h is transformed into a vector_i ^jAfter the feature vector is output to a preset local attention model, the feature vector firstly needs to pass through a local selection module in the local attention model, and the local selection module can perform feature vector h_i ^jIs locally selected and can be found to be N_iFeature vectors of neighboring states, then N_iThe feature vectors of the adjacent states are input into an attention module in the local attention model, and the attention module can acquire each feature vector h_i ^jCorresponding cooperation vector c_i ^j。

It should be noted that the feature vector is output to the preset local attention model, that is, the feature vector h is output to the preset local attention model_i ^jOutputting the data to a local attention model, and obtaining a cooperation vector c through calculation_i ^jAnd outputs the cooperation vector to the corresponding f_i ^jAnd the module, namely the local attention model is a model comprising a local selection module and an attention module, and the second layer of the preset strengthening model and other layers behind the second layer are constructed by the local attention model. Feature vector h_i ^jCollaboration vector c_i ^jAnd f_i ^jThe module is marked with the corresponding layer number, and when j is 0, the module is 0 th layer, namely { f obtained after the signal lamp state passes through the characteristic coding layer₁ ⁰，f₂ ⁰，…，f_i ⁰，…，f_N ⁰}。

First, a feature vector h_i ^jOutput to a local selection module for cascaded vectors h from all individuals₁，h₂，…，h_NOf each individual i, selects an adjacent state h_kAnd k ∈ N_iAfter local selection, the attention module takes the concatenated vector of the neighboring individuals as input. To obtain adjacent individual pair determinationTo determine the degree of influence of individual policies, in the attention module, we first define a weighting parameter W_TAnd W_NDegree of influence e of an individual k on the strategy of an individual i_<i,k>Can be determined by the corresponding formula: e.g. of the type_<i,k>＝(h_iW_T)(h_kW_N)^TCalculation is performed to obtain, after that, by normalizing the degree of influence of the individual k on the strategy of the individual i by means of the softmax function, and the normalized degree of influence can be obtained: a is_<i,k>＝softmax(e_<i,k>)＝[exp(e_<i,k>/τ)]/[Σ_k∈Niexp(e_<i,k>/τ)]According to the obtained standard influence degree a_<i,k>The corresponding cooperation vector c can be obtained by the characteristic vector of the individual k_i ^jCollaboration vector c_i ^jBy the formula: c. C_i ^j＝Σ_k∈Ni*a_<i,k>*h_k ^jAnd (4) calculating. Wherein the degree of influence e of the strategy_<i,k>N in the corresponding formula_iIs a set of adjacent individuals of the individual i, determined by the geographical distance; normalized degree of influence a_<i,k>τ in the correlation equation is a constant coefficient.

And step S230, obtaining a preset signal lamp switching state set according to the characteristic vector and the cooperation vector.

The feature vector h_i ^jAnd cooperation vector c_i ^jIs inputted into f_i ^jModule and can obtain the feature vector h of the next layer_i ^j ⁺¹And through h_i ^j+1And the local attention module of the next layer can obtain the corresponding cooperation vector c_i ^j+1And circularly executing the step S220 and the step S230 until f of the last layer of the model is obtained_i ^jAnd in the last layer of the model, a decoding layer taking softmax as an activation function is used for outputting the action value distribution of the action space, and the action value distribution is input into a set, so that a preset signal lamp switching state set can be obtained.

In addition, f is_i ^jThe module is a linear neural network with a non-linear activation function tanh, f_i ^jThe module can be expressed by the formula: h is_i ^j+1＝f_i ^j(h_i ^j，c_i ^j)＝tanh(H^j*h_i ^j+C^j*c_i ^j) In which H is^jAnd C^jIs the corresponding coefficient matrix.

Referring to fig. 3, a flow chart of a traffic light cooperation control method in an embodiment of the invention is shown. The embodiment discloses a traffic light cooperation control method, which specifically includes, but is not limited to, including step S240 to step S260.

Step S240, using the signal lamp state set as the input of a preset reinforcement learning model;

in step S240, the signal lamp state set is input to the preset reinforcement learning model, and after steps S210, S220, and S230, a preset signal lamp switching state set can be obtained.

Step S250, a preset signal lamp switching state set is used as the output of a preset reinforcement learning model to obtain a reward value and a reward attenuation coefficient;

in step S250, the preset signal lamp switching state set is output, and the signal state set and the preset signal lamp switching state set are substituted into the corresponding formula for digital operation, so as to obtain the corresponding reward value R (S)^k，A^k) And a bonus attenuation coefficient gamma.

It should be noted that, the signal state set and the preset signal lamp switching state set are substituted into a corresponding formula to perform digital operation, and the corresponding formula is as follows: the average length of the queue on each lane of the intersection is used as a reward function. The bonus attenuation coefficient gamma is the corresponding bonus value R (S)^k，A^k) There is an uncertainty factor, and this uncertainty factor results in a corresponding prize value R (S)^k，A^k) Producing a resultant attenuation.

And step S260, obtaining an action state value function according to the reward value, the reward attenuation coefficient, the preset signal lamp switching state set and the signal lamp state set.

It should be noted that in the problem of traffic light cooperative control, the traffic flow is random, cannot be accurately modeled, and can be regarded as a random process. At the same time, the conditional probability distribution of the future state depends only on the current state. Therefore, we model the cooperative traffic light control problem as an MDP, where MDP refers to a markov decision process described by a tuple (S, a, P, R), where: s is a limited state set, A is a limited action set, P is a state transition probability, R is a return function, R is a discount factor and is used for calculating accumulated return, and in addition, the goal of reinforcement learning, namely under a given MDP scene, a strategy is found, so that actions of an intelligent agent in each state are optimal, and the expected total return is maximum. The MDP constructed by the present application is defined by five sections < S, a, P, R, γ >, where S is the state space, a is the space where all available actions are integrated, P is the state transition probability, R is the reward, and γ is the coefficient by which the reward decays over time. The specific introduction is as follows:

the state space, namely the signal lamp state set, is defined as S^t＝{s₁ ^t，s₂ ^t，…，s_N ^tWhere N is the number of intersections containing traffic lights, S_i ^tIs the status of the ith traffic light at time step t. The state of the traffic light comprises the signal phase of the current traffic light and the number of vehicles on each lane connected with the intersection of the traffic light, wherein the signal phase of the traffic light refers to the display state of a signal group corresponding to one or more traffic flows which simultaneously obtain the right of way, and specifically refers to the fact that in a signal period, one or more traffic flows obtain the same signal light color display at any moment, then the continuous time sequence of obtaining different light colors (including green light, yellow light and full red) is called as a signal phase, and each signal phase periodically and alternately obtains the green light display, namely the right of way passing through the intersection, each transition of the right of way is called as a signal phase stage, and each signal period is formed by the fact that the right of way passes through the intersection according to all preset phasesThe sum of the time segments.

The action space includes all feasible actions, and the action space is also a preset signal lamp switching state set. In the present application, the action space of the collaborative traffic light control problem is a complete set of all feasible switching states, which can be denoted as a^t＝{phase₁，phase₂，…，phase_NN is the total number of switchable states. At each time step, each individual will select a state from the action space as its action, as the phase at its next instant, the action of individual i at time step t being defined as a₁ ^t∈A^t. The individual may refer to an object moving on a road, such as a person, an automobile, or an electric vehicle, and is not particularly limited in this application.

The reward space, our goal is to minimize the travel time of all vehicles in the area, which is difficult to optimize directly. Therefore, we use the average length of the queue on each lane of the intersection as the reward function. The bonus space may also be defined as R^t＝{r₁ ^t，r₂ ^t，…，r_N ^t}。

A policy function parameterized by a combination of formulas: pi_θ(S, a) ═ p (a | S, θ), and pi is found from the traffic signal state set and the traffic signal switching state set_θWhere p is an action probability distribution conditioned on state S and parameter θ, the policy is evaluated by the accumulated expected reward, and an action state cost function Q is derived, by the formula: q (S)^t，A^t；π_θ)＝E_πθ[Σ^T _k＝tγ^k-tR(S^k，A^k)]And (4) calculating.

It should be noted that Q is also referred to as an action state cost function and represents the accumulated expected reward obtained by taking a policy, where T represents the mathematical expectation and T is the total time step of the whole evaluation period.

Referring to fig. 4, a flow chart of a traffic light cooperation control method in an embodiment of the invention is shown. The embodiment discloses a traffic light cooperation control method, which specifically includes, but is not limited to, steps S310 to S320.

Step S310, training the motion state value function by adopting a gradient descent training algorithm to obtain a value parameter;

the value parameter omega' can be obtained by training the action state value function Q through a gradient descent training algorithm. The optimization idea of the gradient descent algorithm is to use the direction of the negative gradient at the current position as the search direction, the direction is taken as the fastest descent direction at the current position, and the closer the gradient descent is to the target value, the smaller the variation is.

And step S320, substituting the value parameters into the action state value function, and determining a preset signal lamp switching state set as a signal lamp control strategy.

The method comprises the steps of determining a preset signal lamp switching set as a signal lamp control strategy, namely obtaining an optimal strategy in the preset signal lamp switching set, and obtaining the optimal strategy through interaction with the environment. Therefore, we cannot directly obtain the optimal solution of MDP by Bellman's equation. To solve this problem, we use model-free Q with parameters ω and ω_ωNetwork and Q_ωωThe optimal strategy is obtained by interacting with the environment. According to the bellman equation, it can be defined as: q_ωω(S^t，A^t)＝R^t+γmaxQ_ωω(S^t+1，A^t+1) Value of operating state Q at present_ωωDependent on the reward value R brought by the current action and the value of the action state at the next moment, i.e. maxQ in the formula_ωω(S^t+1，A^t+1)。

It should be noted that, the Q network Q is initialized randomly by using the initial weights ω and ω_ωAnd Q_ωωInitialize the sample buffer R (initially an empty set), perform actions, get the reward R, and get the state transition vector<s^t，a^t，s^t+1，r^t>Storing in a sample buffer R, sampling N state transition vectors from the state buffer as training samples,wherein the state buffer is used for storing the state transition vector.

Referring to fig. 5, a flow chart of a traffic light cooperation control method in an embodiment of the invention is shown. The embodiment discloses a traffic light cooperation control method, which specifically includes, but is not limited to including, step S330.

Step S330, periodically updating the value parameters of the action state value function according to a preset updating mode.

Periodically updating Q according to a preset updating mode_ωω(S^t，A^t) Where ω ω ω ═ τ ω + (1- τ) ω ω, thereby improving learning stability.

Referring to fig. 6, a flow chart of a traffic light cooperation control method in an embodiment of the invention is shown. The embodiment discloses a traffic light cooperation control method, which specifically includes, but is not limited to, steps S340 to S360.

Step S340, substituting the signal control strategy into the action state value function to obtain a predicted action value;

the predicted action value Q can be obtained by substituting the signal control strategy obtained in step S320 into the action state cost function to perform digital operation_ω(S^t，A^t)。

Step S350, acquiring a signal lamp controlled according to a signal control strategy to obtain an actual target action value;

the signal lamp is controlled according to the signal control strategy obtained in step S320, so that the actual target action value Q can be obtained_ωω(S^t，A^t)。

Step S360, determining a loss function according to the predicted action value and the target action value.

Predicted operation value Q according to step S340_ω(S^t，A^t) And the target operation value Q of step S350_ωω(S^t，A^t) Determining a loss function, the loss function being to minimize a squared difference between the target action value and the predicted action value, defined as: l ═ E [ (Q)_ωω(S^t，A^t)-Q_ω(S^t，A^t))²]。

It should be noted that the loss function is a comparison result of the minimized target motion value and the predicted motion value, and by comparing the predicted motion values of all the predicted motions with the minimized target motion value, a comparison result corresponding to each predicted motion can be obtained, and an optimal predicted motion can be obtained according to the magnitude of the comparison result, and the optimal predicted motion is the traffic light control strategy. Wherein, all the predicted actions include the predicted action of each signal lamp road in the preset area, and the minimized target action value is continuously updated with the parameters according to step S330.

Referring to fig. 7, another embodiment of the present invention discloses a traffic light cooperation control apparatus including: the system comprises a state acquisition module 100, a reinforcement learning module 200 and a training module 300, wherein the state acquisition module 100, the reinforcement learning module 200 and the training module 300 are all in communication connection. The state acquisition module 100 is configured to acquire states of traffic lights in a preset area to obtain a signal light state set; the reinforcement learning module 200 is configured to use the signal lamp state set as an input of a preset reinforcement learning model, and use the preset signal lamp switching state set as an output of the preset reinforcement learning model to obtain an action state value function; the training module 300 is configured to train the action state cost function by using a preset training algorithm to obtain a signal lamp control strategy.

Firstly, the state of the traffic lights in the preset area is acquired through the state acquisition module 100, the states of all the traffic lights in the area can be acquired, the integrity of the collected state information is ensured, then, the reinforcement learning module 200 takes the signal light state set as the input of the preset reinforcement learning model, and takes the preset signal light switching state set as the output of the preset reinforcement learning model, secondly, the reinforcement learning module 200 can also acquire an action state value function according to the signal light state set and the preset signal light switching state set, and finally, the training module 300 inputs the action state value function into a preset training algorithm to obtain an approximate global optimal strategy, so that the performance of the traffic light cooperative control is improved.

The specific operation process of the traffic light cooperation control device refers to the above traffic light cooperation control method, and is not described herein again.

The above-described embodiments of the apparatus are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may also be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

One of ordinary skill in the art will appreciate that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

The embodiments of the present invention have been described in detail with reference to the accompanying drawings, but the present invention is not limited to the above embodiments, and various changes can be made within the knowledge of those skilled in the art without departing from the gist of the present invention. Furthermore, the embodiments of the present invention and the features of the embodiments may be combined with each other without conflict.

Claims

1. A traffic light cooperative control method characterized by comprising:

2. The traffic light cooperation control method according to claim 1, wherein the taking the set of signal light states as an input of a preset reinforcement learning model and the set of preset signal light switching states as an output of the preset reinforcement learning model comprises:

taking the signal lamp state set as the input of the preset reinforcement learning model to obtain a feature vector;

3. The traffic light cooperation control method according to claim 1, wherein the taking the signal light state set as an input of a preset reinforcement learning model and the taking the preset signal light switching state set as an output of the preset reinforcement learning model to obtain the action state value function comprises:

4. The traffic light cooperative control method according to claim 1, wherein the preset training algorithm comprises: a gradient descent training algorithm.

5. The traffic light cooperative control method according to claim 3, wherein the training the action state cost function by using a preset training algorithm to obtain a signal light control strategy comprises:

6. The traffic light cooperative control method according to claim 5, wherein the training of the action state cost function by using a preset training algorithm to obtain a signal light control strategy further comprises:

7. The traffic light cooperative control method according to claim 1, wherein the training of the action state cost function using a preset training algorithm to obtain a signal light control strategy further comprises:

8. Traffic light cooperative control apparatus, characterized by comprising:

9. Traffic light cooperative control apparatus characterized by comprising:

at least one processor, and,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the traffic lamp cooperation control method of any one of claims 1 to 7.

10. A computer-readable storage medium storing computer-executable instructions for causing a computer to execute the traffic lamp cooperation control method according to any one of claims 1 to 7.