CN114613159B

CN114613159B - Traffic signal lamp control method, device and equipment based on deep reinforcement learning

Info

Publication number: CN114613159B
Application number: CN202210126521.6A
Authority: CN
Inventors: 江迅
Original assignee: Beijing Luolan Spatiotemporal Data Technology Co ltd
Current assignee: Beijing Luolan Spatiotemporal Data Technology Co ltd
Priority date: 2022-02-10
Filing date: 2022-02-10
Publication date: 2023-07-28
Anticipated expiration: 2042-02-10
Also published as: CN114613159A

Abstract

The application discloses a traffic signal lamp control method, device and equipment based on deep reinforcement learning, which can solve the technical problem that the control mode of the current traffic signal lamp is unfavorable for improving the urban road passing efficiency. The method comprises the following steps: acquiring intersection traffic data of a target traffic intersection, wherein the intersection traffic data comprises at least one of intersection vehicle moving speed, intersection vehicle coordinate position, intersection vehicle queue length, intersection pedestrian number and signal lamp current stage of the target traffic intersection; performing feature conversion processing on the crossing traffic data to obtain crossing feature vectors, and inputting the crossing feature vectors into a traffic signal lamp control model after training is completed to obtain a first traffic control strategy of a target traffic crossing; performing simulation on the first traffic control strategy by using a Monte Carlo tree search algorithm, and determining a second traffic control strategy of the target traffic intersection according to simulation results; and executing the traffic signal lamp control on the target traffic crossing according to the second traffic control strategy.

Description

Traffic signal lamp control method, device and equipment based on deep reinforcement learning

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a traffic signal lamp control method, device and equipment based on deep reinforcement learning.

Background

With the continuous development of social economy and the increasing acceleration of urban processes, the automobile possession of urban residents in China is rapidly increased, and the traffic jam phenomenon of large and medium cities in China is also more serious. The city main road is used as a skeleton of a city road network, is connected with all main subareas of a city, has large traffic flow and is a road section in which traffic jam is most likely to occur in the city road. Therefore, in order to ensure traffic safety and alleviate urban traffic jam, a corresponding control means is required to improve the traffic efficiency of the urban arterial road.

The traffic signal lamps at the intersections of the existing urban arterial roads all work according to a preset program, the switching time and the duration of each signal lamp are fixed, and the actual traffic condition is not considered. Therefore, the existing traffic signal lamp with the fixed working mode is not beneficial to the improvement of the traffic efficiency of the urban road.

Disclosure of Invention

In view of this, the application provides a traffic signal lamp control method, device and equipment based on deep reinforcement learning, which can solve the technical problems that the switching time and duration of the current traffic signal lamp are fixed, the actual traffic condition is not considered, and the traffic efficiency of urban roads is not improved.

According to one aspect of the present application, there is provided a traffic signal control method based on deep reinforcement learning, the method comprising:

acquiring intersection traffic data of a target traffic intersection, wherein the intersection traffic data comprises at least one of intersection vehicle moving speed, intersection vehicle coordinate position, intersection vehicle queue length, intersection pedestrian number and signal lamp current stage of the target traffic intersection;

performing feature conversion processing on the crossing traffic data to obtain crossing feature vectors, and inputting the crossing feature vectors into a traffic signal lamp control model after training to obtain a first traffic control strategy of the target traffic crossing;

performing simulation on the first traffic control strategy by using a Monte Carlo tree search algorithm, and determining a second traffic control strategy of the target traffic intersection according to simulation results;

and executing the traffic signal lamp control on the target traffic crossing according to the second traffic control strategy.

According to another aspect of the present application, there is provided a traffic signal control apparatus based on deep reinforcement learning, the apparatus comprising:

the system comprises an acquisition module, a signal lamp and a signal lamp, wherein the acquisition module is used for acquiring intersection traffic data of a target traffic intersection, and the intersection traffic data comprises at least one of intersection vehicle moving speed, intersection vehicle coordinate positions, intersection vehicle queue length, intersection pedestrian number and signal lamp current stage of the target traffic intersection;

The input module is used for carrying out feature conversion processing on the crossing traffic data to obtain crossing feature vectors, and inputting the crossing feature vectors into the traffic signal lamp control model after training is completed to obtain a first traffic control strategy of the target traffic crossing;

the determining module is used for carrying out simulation on the first traffic control strategy by utilizing a Monte Carlo tree searching algorithm, and determining a second traffic control strategy of the target traffic intersection according to simulation results;

and the control module is used for executing the traffic signal lamp control on the target traffic crossing according to the second traffic control strategy.

According to yet another aspect of the present application, there is provided a non-volatile readable storage medium having stored thereon a computer program which when executed by a processor implements the above-described deep reinforcement learning based traffic light control method.

According to still another aspect of the present application, there is provided a computer device including a non-volatile readable storage medium, a processor, and a computer program stored on the non-volatile readable storage medium and executable on the processor, the processor implementing the above-described traffic light control method based on deep reinforcement learning when executing the program.

By means of the technical scheme, the traffic signal lamp control method, the device and the equipment based on the deep reinforcement learning can firstly acquire the intersection traffic data of the target traffic intersection compared with the control mode of the current traffic signal lamp, wherein the intersection traffic data comprise at least one of the intersection vehicle moving speed, the intersection vehicle coordinate position, the intersection vehicle queue length, the number of pedestrians at the intersection and the current stage of the signal lamp of the target traffic intersection; after the intersection characteristic vector is obtained, the intersection characteristic vector can be input into a traffic signal lamp control model after training is completed, and a first traffic control strategy of a target traffic intersection is obtained; further performing simulation on the first traffic control strategy by using a Monte Carlo tree search algorithm, and determining a second traffic control strategy of the target traffic intersection according to simulation results; and finally, executing the traffic signal lamp control on the target traffic crossing according to the second traffic control strategy. Through the technical scheme in this application, can apply the deep reinforcement learning algorithm of signal lamp regulation and control with Monte Carlo tree search, through the complex calculation scheme of deep learning and reinforcement learning, can be according to real-time crossing traffic data, the intelligent tactics regulation and control to the traffic signal lamp is realized to the developments, can promote the passing efficiency of vehicle at the crossing by a wide margin, can also energy saving and emission reduction promote sustainable development when alleviating urban congestion.

The foregoing description is only an overview of the technical solutions of the present application, and may be implemented according to the content of the specification in order to make the technical means of the present application more clearly understood, and in order to make the above-mentioned and other objects, features and advantages of the present application more clearly understood, the following detailed description of the present application will be given.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the present application. In the drawings:

fig. 1 shows a schematic flow chart of a traffic light control method based on deep reinforcement learning according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of another method for controlling traffic lights based on deep reinforcement learning according to an embodiment of the present application;

FIG. 3 illustrates a network architecture diagram of a training traffic light control model provided by an embodiment of the present application;

fig. 4 is a schematic diagram of determining a first traffic control policy according to an embodiment of the present application;

fig. 5 shows a schematic structural diagram of a traffic signal lamp control device based on deep reinforcement learning according to an embodiment of the present application;

Fig. 6 shows a schematic structural diagram of another traffic signal lamp control device based on deep reinforcement learning according to an embodiment of the present application.

Detailed Description

The embodiment of the application can realize the control of the traffic signal lamp based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.

Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

The present application will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that, without conflict, the embodiments and features of the embodiments in the present application may be combined with each other.

Aiming at the technical problems that the switching time and the duration of the current traffic signal lamp are fixed, and the actual traffic condition is not considered, and the traffic efficiency of urban roads is not improved, the embodiment of the application provides a traffic signal lamp control method based on deep reinforcement learning, as shown in fig. 1, which comprises the following steps:

101. and acquiring intersection traffic data of the target traffic intersection.

The target traffic intersection is any traffic intersection to be intelligently regulated by the traffic signal lamp, the intersection traffic data is real-time data of the target traffic intersection at the current moment, and the intersection traffic data can comprise at least one of intersection vehicle moving speed, intersection vehicle coordinate position, intersection vehicle queue length, intersection pedestrian number and signal lamp current stage of the target traffic intersection.

The execution main body of the traffic signal lamp control system can be configured at a client side or a server side, a trained traffic signal lamp control model is built in the traffic signal lamp control system, when the traffic signal lamp is controlled, intersection traffic data of a target passing intersection can be obtained in real time, after feature conversion processing is carried out on the intersection traffic data, the intersection feature vector obtained through conversion is input into the traffic signal lamp control model, and a first traffic control strategy of the target passing intersection is obtained; further performing simulation on the first traffic control strategy by using a Monte Carlo tree search algorithm, and determining a second traffic control strategy of the target traffic intersection according to simulation results; and finally, executing the traffic signal lamp control on the target traffic crossing according to the second traffic control strategy.

102. And performing feature conversion processing on the crossing traffic data to obtain crossing feature vectors, and inputting the crossing feature vectors into the trained traffic signal lamp control model to obtain a first traffic control strategy of the target traffic crossing.

The traffic signal lamp control model is a deep reinforcement learning model, the deep reinforcement learning model comprises a plurality of strategy models and a plurality of evaluation models, the evaluation models are used for determining evaluation values of action strategies output by the strategy models, and model parameters of the deep reinforcement learning model are updated by the aid of the evaluation values.

In a specific application scenario, since the intersection traffic data may include multiple dimensions, before the intersection traffic data is input into the trained traffic signal lamp control model, feature conversion processing is further required for the intersection traffic data, so that all the intersection traffic data are converted into the same feature dimension. In addition, as an alternative way, in order to improve the feature saturation of the feature vector of the intersection, before the feature conversion processing is performed on the intersection traffic data, normalization processing may be performed on the intersection traffic data in advance, so as to obtain normalized data only including important feature dimensions, where the normalized data corresponding to the important feature dimensions includes six aspects of uniqueness, integrity, effectiveness, rationality, consistency and accuracy, so as to filter noise data which does not coincide with a service scene and even easily generates interference. The normalization processing can comprise data verification, cleaning, statistical analysis and the like, and can specifically realize extraction of important dimension data of features, deletion of error data, filling of partial missing data and the like through the normalization processing.

In a specific application scenario, as a preferred mode, before executing the steps of the embodiment, the traffic signal lamp control model needs to be pre-trained, after judging that the training of the traffic signal lamp control model is completed, the intersection feature vector can be input into the trained traffic signal lamp control model, and a first traffic control strategy of the target traffic intersection is further obtained. The first traffic control strategy is an action decision which is preliminarily obtained by the traffic signal lamp control model according to the current relevant index of the target traffic intersection.

In a specific application scene, after a first traffic control strategy of a target traffic intersection is determined by using a traffic signal lamp control model which is completed through training, as an optional mode, traffic signal lamp control on the target traffic intersection can be directly executed according to the first traffic control strategy; in order to ensure that the traffic signal lamp is effectively regulated and controlled, after the first traffic control strategy is determined, as another alternative, steps 103 and 104 of the embodiment can be further executed, a second better traffic control strategy is determined according to the first traffic control strategy, and the traffic signal lamp control on the target traffic intersection is executed by using the second traffic control strategy. In the present application, the technical solution of performing traffic light control on the target traffic intersection by using the second traffic control policy is preferable, but the technical solution in the present application is not particularly limited.

103. And performing simulation on the first traffic control strategy by using a Monte Carlo tree search algorithm, and determining a second traffic control strategy of the target traffic intersection according to simulation results.

In the simulation environment, digital twinning can be carried out on the current intersection situation, the signal lamp of each intersection carries out multi-step strategy simulation according to the UCB algorithm, possible follow-up results of some current strategies can be simulated in the simulation environment in the algorithm execution process after model training is finished, and more ideal strategies can be obtained for extreme situations which are not processed in the deep reinforcement learning training. Therefore, for the embodiment, in order to ensure that the finally determined traffic control policy is the optimal policy, after the first traffic control policy of the target traffic intersection is determined, the first traffic control policy may be further simulated by using a monte carlo tree search algorithm, so as to determine the optimal policy of the target traffic intersection, that is, the second traffic control policy, according to the simulation result. In addition, for some extreme crossing situations which are not covered during model training, the current crossing situation can be simulated in real time through Monte Carlo tree search, and various crossing situations can be processed more preferably according to the strategy updated according to the simulation result. Due to the high efficiency of the algorithm, real-time policy decisions can be made in an actual system.

104. And executing the traffic signal lamp control on the target traffic crossing according to the second traffic control strategy.

Through the traffic signal lamp control method based on deep reinforcement learning in the embodiment, intersection traffic data of a target traffic intersection can be firstly obtained, wherein the intersection traffic data comprises at least one of intersection vehicle moving speed, intersection vehicle coordinate position, intersection vehicle queue length, intersection pedestrian number and signal lamp current stage of the target traffic intersection; after the intersection characteristic vector is obtained, the intersection characteristic vector can be input into a traffic signal lamp control model after training is completed, and a first traffic control strategy of a target traffic intersection is obtained; further performing simulation on the first traffic control strategy by using a Monte Carlo tree search algorithm, and determining a second traffic control strategy of the target traffic intersection according to simulation results; and finally, executing the traffic signal lamp control on the target traffic crossing according to the second traffic control strategy. Through the technical scheme in this application, can apply the deep reinforcement learning algorithm of signal lamp regulation and control with Monte Carlo tree search, through the complex calculation scheme of deep learning and reinforcement learning, can be according to real-time crossing traffic data, the intelligent tactics regulation and control to the traffic signal lamp is realized to the developments, can promote the passing efficiency of vehicle at the crossing by a wide margin, can also energy saving and emission reduction promote sustainable development when alleviating urban congestion.

Further, as a refinement and extension of the specific implementation manner of the foregoing embodiment, in order to fully describe the specific implementation process in this embodiment, another traffic signal control method based on deep reinforcement learning is provided, as shown in fig. 2, where the method includes:

201. the method comprises the steps of obtaining intersection traffic data of a target traffic intersection, wherein the intersection traffic data comprise at least one of intersection vehicle moving speed, intersection vehicle coordinate positions, intersection vehicle queue length, intersection pedestrian number and signal lamp current stage of the target traffic intersection.

202. Global intersection information in a target area is acquired, the global intersection information is input into a deep reinforcement learning model, predicted traffic control strategies of different traffic intersections are respectively determined by using a plurality of strategy models in the deep reinforcement learning model, and evaluation values of the predicted traffic control strategies are determined by using an evaluation model.

The target area is a region range to be globally controlled by the traffic signal lamp, and may be specifically defined as a province, a city, a county, or a dividing area with a preset area point as a center and a preset distance as a radius, which is not specifically limited herein. The target area can contain a plurality of traffic intersections, and the global intersection information comprises historical intersection traffic data of the plurality of traffic intersections and historical traffic control strategies of the corresponding adjacent traffic intersections of each traffic intersection. The deep reinforcement learning model comprises a plurality of strategy models, each strategy model corresponds to one evaluation model, the evaluation model is used for determining an evaluation value of an action strategy output by the strategy model, and model parameters of the deep reinforcement learning model are updated by using the evaluation value.

For the embodiment, according to the deep reinforcement learning principle, the regulation and control of each passing intersection in the target area is performed by a strategy model, hundreds of strategy models can exist under the condition of urban signal lamp regulation and control, the regulation and control of each intersection signal lamp can influence the traffic condition of surrounding intersections, the control behavior and efficiency of surrounding traffic lights are further influenced, a globally optimal regulation and control algorithm can be obtained only by carrying out cooperative regulation and control on the strategy models, the strategy obtained by training each strategy model can be judged more reasonably through changing the construction mode of the evaluation model of each strategy model, namely increasing the perception range, rather than merely taking the congestion condition of the current intersection as the evaluation index of the current traffic lights, the congestion condition of the surrounding intersections also can be taken as the indirect influence result of the current action strategy, and according to the change of the algorithm design, all the strategy models form a common cooperation team, and finally the globally optimal algorithm strategy of urban signal lamp regulation and control is obtained.

203. Updating network parameters of a corresponding value function of the strategy model according to the evaluation value of the predicted traffic control strategy until reaching a convergence state; if all strategy models in the deep reinforcement learning model are judged to reach a convergence state, the deep reinforcement learning model is judged to be trained, and the trained deep reinforcement learning model is defined as a traffic signal lamp control model.

For deep reinforcement learning, pi _i Policy function representing agent, a _i Representing the corresponding actions of the intelligent agent, in particular the selection of signal lamp stages of the crossing, such as straight going north-south, turning left, and the like, s _i Representing state space, mainly the queue length, waiting time, speed, position, current state of signal lamp, number of waiting traffic light pedestrians, r _i Representing rewards feedback, in particular queue length and waiting time of crossing vehicles after current decision Δt time, Q _i Representing the Q-function corresponding to each agent, which requires the state and actions of the other agents as part of the input, Q _i (s _i ,a _i ) Q value obtained for the agent to take action.

In the training process of the deep reinforcement learning model, the strategy model can determine a corresponding action strategy according to the state of the current scene image, and then an evaluation value is obtained according to the evaluation model, and the next state is reached. Judging the quality of the selection action at the moment by the strategy model according to the obtained evaluation value, and updating the network parameters of the value function; then, judging whether the action is performed by the user or not according to the evaluation value obtained from the next state; and (5) circularly obtaining an evaluation value until training is finished, and obtaining a better value function network. As shown in FIG. 3, the Actor 1 … N corresponds to a strategy model, and when N traffic intersections are included in the target area, at least N actors are included in the deep reinforcement learning model, each of which The method comprises the steps that each Actor is used for realizing regulation and control of traffic signals of a passing intersection, global intersection information Observation in a target area can be input into each strategy model Actor, the strategy model Actor can determine corresponding current Action strategy actions according to the global intersection information Observation, then the Action strategy actions determined by the global intersection information Observation, the current Action strategy actions and other strategy models are input into an evaluation model Critic1 … N to obtain an evaluation value Reward of the current Action strategy, and then the evaluation value Reward is input into the strategy model Actor to update network parameters of a strategy model corresponding value function according to the evaluation value until a convergence state is achieved. In the iterative training process, parameter updating can be performed according to a gradient descent method, and finally a local optimal solution is converged, and each strategy model avoids mutual communication among strategy models through central training and distributed execution, wherein a is as follows _i ＝π _i (s _i )，

204. And performing feature conversion processing on the crossing traffic data to obtain crossing feature vectors, and inputting the crossing feature vectors into the trained traffic signal lamp control model to obtain a first traffic control strategy of the target traffic crossing.

In a specific application scenario, since the intersection traffic data may include multiple dimensions, before the intersection traffic data is input into the trained traffic signal lamp control model, feature conversion processing is further required for the intersection traffic data, so that all the intersection traffic data are converted into the same feature dimension. As shown in fig. 4, for the intersection vehicle moving speed V and the intersection vehicle coordinate position P of the target traffic intersection in the intersection traffic data, feature conversion processing of the activation function (Convolume+ReLU) and feature extraction may be performed multiple times; for the crossing vehicle queue length D in the crossing traffic data, feature conversion processing of at least one feature extraction and activation functions (Convolume+ReLU) can be performed; and feature extraction (full connected) can be performed on the number M of crossing pedestrians and the current stage L of the signal lamp in the crossing traffic data through a full connection layer. Further, feature fusion processing (connection) can be performed on features of the intersection traffic data after feature conversion to obtain intersection feature vectors, the intersection feature vectors are further input into a strategy model corresponding to the target traffic intersection Agent 1 … N in the traffic signal lamp control model, and the first traffic control strategy Action is determined. For this embodiment, when the intersection feature vector is input into the trained traffic signal lamp control model to obtain the first traffic control policy of the target traffic intersection, step 204 of the embodiment may specifically include: and inputting the intersection characteristic vector into the traffic signal lamp control model after training, and generating a first traffic control strategy of the target traffic intersection by using any strategy model in the traffic signal lamp control model.

205. And performing simulation on the first traffic control strategy by using a Monte Carlo tree search algorithm, and determining a second traffic control strategy of the target traffic intersection according to simulation results.

In a specific application scene, after a first traffic control strategy of a target traffic intersection is determined by using a trained traffic signal lamp control model, real-time Monte Carlo tree search can be performed, each intersection situation is restored in a simulation environment mainly comprising a vehicle state and a signal lamp state, a reasonable feedback reward is obtained by simulating the decision of a signal lamp in the simulation environment, and the final strategy of each intersection traffic light agent is determined according to the simulation results and the decision of the trained model. The process is divided into four steps:

1) Action selection: selecting leaf nodes in tree according to UCB algorithm

2) Tree expansion: randomly selecting unviewed child nodes of a leaf node

3) Performing simulation: and carrying out subsequent simulation according to the selected actions, and collecting the reward data.

4) Backtracking and back transmission: and carrying out back transmission on the passed tree nodes according to the simulation result, updating the corresponding numerical values, and recording the numerical values as the basis of final decision.

Accordingly, for the present embodiment, the embodiment step 205 may specifically include: converting the first traffic control strategy into tree search and expanding a tree structure; selecting leaf nodes in the unfolded tree structure according to the UCB algorithm, and randomly selecting unviewed child nodes of the leaf nodes; performing simulation in the sub-nodes which are not accessed, reversing the tree nodes which are passed through in the simulation process according to the simulation result, and updating the simulation total rewarding value and the access times of the tree nodes; and determining a second traffic control strategy of the target traffic intersection according to the simulated total rewarding value and the access times of the tree nodes.

Wherein the simulated total prize value is an attribute of the tree node, in its simplest form the sum of the simulated results of the tree node under consideration; the number of accesses is another attribute of the tree node that represents the number of times the tree node is on the back propagation path (and at the same time its contribution to the total simulated reward). Both the simulated total prize value and the number of accesses of a tree node may reflect the potential value and the degree of exploration of the tree node. Tree nodes with higher simulated total prize values may be good candidates, but nodes with lower access may be very valuable to access (because they have not been explored well), so when determining the second pass-through control strategy for the target pass-through based on the simulated total prize values and access times of the tree nodes, a specific strategy path may be determined based on tree nodes with higher simulated total prize values and lower access times.

In a specific application scenario, the process of carrying out simulation on the first traffic control strategy by utilizing the Monte Carlo tree search algorithm can be repeated until the system needs to give a decision, namely before the end of the current green light, namely before the yellow light, the Monte Carlo tree search ends simulation, the searched optimal strategy is given according to the simulation result, the more the simulation times of repeated search are according to the support of hardware calculation force, the better traffic light strategy can be theoretically obtained, the theory is verified according to the operation result of the actual algorithm, but the promotion effect of the strategy is not obvious after the number of Monte Carlo tree searches reaches a certain threshold value, and a saturation value is reached. Accordingly, before determining the second traffic control policy of the target traffic intersection according to the simulated total prize value and the access times of the tree node, the embodiment steps further include: repeatedly determining a first pass control strategy and performing simulation on the first pass control strategy until the fact that a preset simulation ending condition is judged to be met, obtaining a simulation total rewarding value and access times of tree nodes, wherein the preset simulation ending condition is a time length for reaching a maximum decision in a determination process of a second pass control strategy, and/or verifying that the verification times of the first pass control strategy reach a maximum search simulation times based on a Monte Carlo tree search algorithm; the determining process of the first traffic control strategy comprises the following steps: updating the model parameters of the traffic signal lamp control model, and re-determining the first traffic control strategy of the target traffic intersection by using the traffic signal lamp control model after updating the model parameters.

206. And generating a control instruction of the target traffic intersection for the traffic signal lamp at the next moment according to the second traffic control strategy, and executing control on the traffic signal lamp based on the control instruction.

In a specific application scenario, as a preferred manner, in order to implement cooperative regulation and control on the number of traffic lights in the urban level, the steps of the embodiment may further include: and receiving an updated traffic control strategy of a target traffic intersection corresponding to any adjacent traffic intersection in the target area, and updating a second traffic control strategy of the target traffic intersection according to the updated traffic control strategy.

In the existing scheme, the KS-DDPG algorithm can be adopted to realize the cooperative regulation and control of the urban traffic lights, the algorithm can be used for central training and distributed execution, and when the system is executed, each signal light strategy model can make a decision only according to the current crossing data without carrying out a large amount of data transmission, but the problem is that the algorithm introduces a knowledge container, each signal light strategy model needs to access and modify the public knowledge container, a large amount of information transmission is involved along with the increase of the number of signal lights, and better signal light decisions cannot be given out for the extreme crossing conditions which are not met by some algorithms during training. In view of this, in the present application, for the communication problem between the signal lamp policy models, only global intersection information needs to be obtained to perform training of the evaluation model Critic during training, and when the system is executed, each signal lamp policy model only needs to perform action decision according to the data of the current intersection, and the rest of calculation operations only need to be put in the cloud. For some extreme crossing situations which are not covered during model training, the current crossing situation can be simulated in real time through Monte Carlo tree search, and various crossing situations can be processed more preferably according to the strategy updated according to the simulation result. Due to the high efficiency of the algorithm, real-time policy decisions can be made in an actual system.

By means of the traffic signal lamp control method based on deep reinforcement learning, intersection traffic data of a target traffic intersection can be obtained first, wherein the intersection traffic data comprise at least one of intersection vehicle moving speed, intersection vehicle coordinate position, intersection vehicle queue length, intersection pedestrian number and signal lamp current stage of the target traffic intersection; after the intersection characteristic vector is obtained, inputting the intersection characteristic vector into a traffic signal lamp control model after training is completed, and obtaining a first traffic control strategy of a target traffic intersection; further performing simulation on the first traffic control strategy by using a Monte Carlo tree search algorithm, and determining a second traffic control strategy of the target traffic intersection according to simulation results; and finally, executing the traffic signal lamp control on the target traffic crossing according to the second traffic control strategy. Through the technical scheme in this application, can apply the deep reinforcement learning algorithm of signal lamp regulation and control with Monte Carlo tree search, through the complex calculation scheme of deep learning and reinforcement learning, can be according to real-time crossing traffic data, the intelligent tactics regulation and control to the traffic signal lamp is realized to the developments, can promote the passing efficiency of vehicle at the crossing by a wide margin, can also energy saving and emission reduction promote sustainable development when alleviating urban congestion. In addition, the scheme does not need mutual communication among signal lamps, reduces the cost and complexity of a hardware architecture, and reduces the communication cost when the system operates. For some extreme crossing situations which are not covered during model training, the current crossing situation can be simulated in real time through Monte Carlo tree search, and various crossing situations can be processed more preferably according to the strategy updated according to the simulation result. Due to the high efficiency of the algorithm, real-time policy decisions can be made in an actual system.

Further, as a specific implementation of the method shown in fig. 1 and fig. 2, an embodiment of the present application provides a traffic signal lamp control device based on deep reinforcement learning, as shown in fig. 5, the device includes: the device comprises an acquisition module 31, an input module 32, a determination module 33 and a control module 34;

the acquiring module 31 is configured to acquire intersection traffic data of a target traffic intersection, where the intersection traffic data includes at least one of an intersection vehicle moving speed, an intersection vehicle coordinate position, an intersection vehicle queue length, an intersection pedestrian number, and a signal lamp current stage;

the input module 32 is configured to perform feature conversion processing on the intersection traffic data, obtain an intersection feature vector, and input the intersection feature vector into the trained traffic signal lamp control model to obtain a first traffic control policy of the target traffic intersection;

the determining module 33 may be configured to perform simulation on the first traffic control policy by using a monte carlo tree search algorithm, and determine a second traffic control policy of the target traffic intersection according to a simulation result;

the control module 34 may be configured to perform traffic light control on the target traffic intersection according to the second traffic control policy.

In a specific application scene, the traffic signal lamp control model is a deep reinforcement learning model, the deep reinforcement learning model comprises a plurality of strategy models and a plurality of evaluation models, the evaluation models are used for determining evaluation values of action strategies output by the strategy models, and model parameters of the deep reinforcement learning model are updated by the evaluation values; to achieve pre-training of the traffic light control model, as shown in fig. 6, the apparatus further comprises: a training module 35;

the training module 35 is configured to obtain global intersection information in the target area, where the global intersection information includes historical intersection traffic data of a plurality of traffic intersections and historical traffic control policies of adjacent traffic intersections corresponding to the traffic intersections; inputting global intersection information into a deep reinforcement learning model, respectively determining predicted traffic control strategies of different traffic intersections by using a plurality of strategy models in the deep reinforcement learning model, and determining evaluation values of the predicted traffic control strategies by using an evaluation model; updating network parameters of a corresponding value function of the strategy model according to the evaluation value of the predicted traffic control strategy until reaching a convergence state; if all strategy models in the deep reinforcement learning model are judged to reach a convergence state, the deep reinforcement learning model is judged to be trained, and the trained deep reinforcement learning model is defined as a traffic signal lamp control model.

In a specific application scenario, when determining the first traffic control strategy of the target traffic intersection by using the trained traffic signal lamp control model, the input module 32 may be specifically configured to input the intersection feature vector into the trained traffic signal lamp control model, and generate the first traffic control strategy of the target traffic intersection by using any one of the traffic signal lamp control models.

In a specific application scenario, when the first traffic control strategy is simulated by using the monte carlo tree search algorithm and the second traffic control strategy of the target traffic intersection is determined according to the simulation result, the determining module 33 is specifically configured to convert the first traffic control strategy into tree search and perform tree structure expansion; selecting leaf nodes in the unfolded tree structure according to the UCB algorithm, and randomly selecting unviewed child nodes of the leaf nodes; performing simulation in the sub-nodes which are not accessed, reversing the tree nodes which are passed through in the simulation process according to the simulation result, and updating the simulation total rewarding value and the access times of the tree nodes; and determining a second traffic control strategy of the target traffic intersection according to the simulated total rewarding value and the access times of the tree nodes.

In a specific application scenario, before determining a second traffic control strategy of a target traffic intersection according to a simulated total rewarding value and access times of tree nodes, repeatedly determining a first traffic control strategy and performing simulation on the first traffic control strategy until a preset simulation ending condition is judged to be met, and obtaining the simulated total rewarding value and access times of the tree nodes so as to generate the second traffic control strategy of the target traffic intersection based on the finally determined simulated total rewarding value and access times of the tree nodes. The method comprises the steps that a simulation ending condition is preset, and the simulation ending condition is a time length for which a determining process of a second traffic control strategy reaches the maximum decision, and/or a verification frequency for verifying a first traffic control strategy reaches the maximum search simulation frequency based on a Monte Carlo tree search algorithm; the determining process of the first traffic control strategy comprises the following steps: updating the model parameters of the traffic signal lamp control model, and re-determining the first traffic control strategy of the target traffic intersection by using the traffic signal lamp control model after updating the model parameters.

In a specific application scenario, when the traffic light control of the target traffic intersection is performed according to the second traffic control policy, the control module 34 is specifically configured to generate a control instruction of the target traffic intersection for the traffic light at the next moment according to the second traffic control policy, and perform the control of the traffic light based on the control instruction.

In a specific application scenario, in order to realize cooperative regulation and control on the number of traffic lights in the urban level, as shown in fig. 6, the device further includes: an update module 36;

the updating module 36 is configured to receive an updated traffic control policy of a target traffic intersection corresponding to any adjacent traffic intersection in the target area, and update a second traffic control policy of the target traffic intersection according to the updated traffic control policy.

It should be noted that, other corresponding descriptions of each functional unit related to the traffic signal lamp control device based on deep reinforcement learning provided in this embodiment may refer to corresponding descriptions of fig. 1 to 2, and are not repeated here.

Based on the above-described methods shown in fig. 1 to 2, correspondingly, the present embodiment further provides a nonvolatile storage medium, on which computer readable instructions are stored, which when executed by a processor, implement the above-described traffic signal control method based on deep reinforcement learning shown in fig. 1 to 2.

Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.), and includes several instructions for causing a computer device (may be a personal computer, a server, or a network device, etc.) to perform the method of each implementation scenario of the present application.

Based on the method shown in fig. 1 to 2 and the virtual device embodiments shown in fig. 5 and 6, in order to achieve the above object, the present embodiment further provides a computer device, where the computer device includes a storage medium and a processor; a nonvolatile storage medium storing a computer program; a processor for executing a computer program to implement the above-described deep reinforcement learning-based traffic light control method as shown in fig. 1 to 2.

Optionally, the computer device may also include a user interface, a network interface, a camera, radio Frequency (RF) circuitry, sensors, audio circuitry, WI-FI modules, and the like. The user interface may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, etc. The network interface may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), etc.

It will be appreciated by those skilled in the art that the architecture of a computer device provided in this embodiment is not limited to this physical device, but may include more or fewer components, or may be combined with certain components, or may be arranged in a different arrangement of components.

The nonvolatile storage medium may also include an operating system, network communication modules. An operating system is a program that manages the computer device hardware and software resources described above, supporting the execution of information handling programs and other software and/or programs. The network communication module is used for realizing communication among all components in the nonvolatile storage medium and communication with other hardware and software in the information processing entity equipment.

From the above description of the embodiments, it will be apparent to those skilled in the art that the present application may be implemented by means of software plus necessary general hardware platforms, or may be implemented by hardware.

By applying the technical scheme, compared with the prior art, the method and the device can obtain the intersection traffic data of the target traffic intersection, wherein the intersection traffic data comprises at least one of the intersection vehicle moving speed, the intersection vehicle coordinate position, the intersection vehicle queue length, the intersection pedestrian number and the signal lamp current stage of the target traffic intersection; after the intersection characteristic vector is obtained, inputting the intersection characteristic vector into a traffic signal lamp control model after training is completed, and obtaining a first traffic control strategy of a target traffic intersection; further performing simulation on the first traffic control strategy by using a Monte Carlo tree search algorithm, and determining a second traffic control strategy of the target traffic intersection according to simulation results; and finally, executing the traffic signal lamp control on the target traffic crossing according to the second traffic control strategy. Through the technical scheme in this application, can apply the deep reinforcement learning algorithm of signal lamp regulation and control with Monte Carlo tree search, through the complex calculation scheme of deep learning and reinforcement learning, can be according to real-time crossing traffic data, the intelligent tactics regulation and control to the traffic signal lamp is realized to the developments, can promote the passing efficiency of vehicle at the crossing by a wide margin, can also energy saving and emission reduction promote sustainable development when alleviating urban congestion. In addition, the scheme does not need mutual communication among signal lamps, reduces the cost and complexity of a hardware architecture, and reduces the communication cost when the system operates. For some extreme crossing situations which are not covered during model training, the current crossing situation can be simulated in real time through Monte Carlo tree search, and various crossing situations can be processed more preferably according to the strategy updated according to the simulation result. Due to the high efficiency of the algorithm, real-time policy decisions can be made in an actual system.

Those skilled in the art will appreciate that the drawings are merely schematic illustrations of one preferred implementation scenario, and that the modules or flows in the drawings are not necessarily required to practice the present application. Those skilled in the art will appreciate that modules in an apparatus in an implementation scenario may be distributed in an apparatus in an implementation scenario according to an implementation scenario description, or that corresponding changes may be located in one or more apparatuses different from the implementation scenario. The modules of the implementation scenario may be combined into one module, or may be further split into a plurality of sub-modules.

The foregoing application serial numbers are merely for description, and do not represent advantages or disadvantages of the implementation scenario. The foregoing disclosure is merely a few specific implementations of the present application, but the present application is not limited thereto and any variations that can be considered by a person skilled in the art shall fall within the protection scope of the present application.

Claims

1. The traffic signal lamp control method based on deep reinforcement learning is characterized by comprising the following steps of:

Performing feature conversion processing on the crossing traffic data to obtain crossing feature vectors, inputting the crossing feature vectors into a traffic signal lamp control model after training to obtain a first traffic control strategy of the target crossing, wherein the traffic signal lamp control model is a deep reinforcement learning model, the deep reinforcement learning model comprises a plurality of strategy models and a plurality of evaluation models, the evaluation models are used for determining evaluation values of action strategies output by the strategy models, and model parameters of the deep reinforcement learning model are updated by the evaluation values;

2. The method of claim 1, wherein the traffic light control model is a deep reinforcement learning model comprising a plurality of strategy models and a plurality of assessment models, the assessment models being used to determine an assessment value of an action strategy output by the strategy models, and to update model parameters of the deep reinforcement learning model with the assessment value;

Before inputting the intersection feature vector into the trained traffic signal lamp control model to obtain the first traffic control strategy of the target traffic intersection, the method further comprises the following steps:

acquiring global intersection information in a target area, wherein the global intersection information comprises historical intersection traffic data of a plurality of traffic intersections and historical traffic control strategies of the corresponding adjacent traffic intersections of each traffic intersection;

inputting the global intersection information into the deep reinforcement learning model, respectively determining predicted traffic control strategies of different traffic intersections by utilizing a plurality of strategy models in the deep reinforcement learning model, and determining evaluation values of the predicted traffic control strategies by utilizing the evaluation model;

updating network parameters of the corresponding value function of the strategy model according to the evaluation value of the predicted traffic control strategy until reaching a convergence state;

and if the strategy models in the deep reinforcement learning model are judged to be all in a convergence state, judging that the training of the deep reinforcement learning model is completed, and defining the trained deep reinforcement learning model as a traffic signal lamp control model.

3. The method of claim 2, wherein inputting the intersection feature vector into a trained traffic light control model to obtain a first traffic control strategy for the target traffic intersection comprises:

Inputting the intersection feature vector into a traffic signal lamp control model which is trained, and generating a first traffic control strategy of the target traffic intersection by using any strategy model in the traffic signal lamp control model.

4. The method of claim 1, wherein the performing a simulation of the first traffic control strategy using a monte carlo tree search algorithm, and determining the second traffic control strategy for the target traffic intersection based on the simulation result, comprises:

converting the first traffic control strategy into tree search and expanding a tree structure;

selecting leaf nodes in the unfolded tree structure according to UCB algorithm, and randomly selecting sub-nodes of the leaf nodes which are not visited;

performing simulation in the sub-nodes which are not accessed, reversing tree nodes passing through a simulation process according to a simulation result, and updating a simulation total rewarding value and access times of the tree nodes;

and determining a second traffic control strategy of the target traffic intersection according to the simulated total rewarding value and the access times of the tree nodes.

5. The method of claim 4, further comprising, prior to said determining a second traffic control strategy for said target traffic intersection based on said simulated total prize value and number of accesses of said tree node:

Repeatedly determining the first pass control strategy and performing simulation on the first pass control strategy until a preset simulation ending condition is judged to be met, obtaining a simulation total rewarding value and access times of the tree nodes, wherein the preset simulation ending condition is a time length for reaching a maximum decision in the determination process of the second pass control strategy, and/or verifying that the verification times of the first pass control strategy reach a maximum search simulation times based on a Monte Carlo tree search algorithm;

the determining process of the first traffic control strategy comprises the following steps: and updating the model parameters of the traffic signal lamp control model, and re-determining the first traffic control strategy of the target traffic intersection by using the traffic signal lamp control model after updating the model parameters.

6. The method of claim 1, wherein said performing traffic light control of said target pass-through according to said second pass-through control strategy comprises:

and generating a control instruction of the target traffic intersection for the traffic signal lamp at the next moment according to the second traffic control strategy, and executing control on the traffic signal lamp based on the control instruction.

7. The method according to claim 1, wherein the method further comprises:

and receiving an updated traffic control strategy of any adjacent traffic intersection corresponding to the target traffic intersection in the target area, and updating a second traffic control strategy of the target traffic intersection according to the updated traffic control strategy.

8. A traffic light control device based on deep reinforcement learning, comprising:

the input module is used for carrying out feature conversion processing on the crossing traffic data, obtaining crossing feature vectors, inputting the crossing feature vectors into a traffic signal lamp control model which is trained, and obtaining a first traffic control strategy of the target crossing, wherein the traffic signal lamp control model is a deep reinforcement learning model, the deep reinforcement learning model comprises a plurality of strategy models and a plurality of evaluation models, the evaluation models are used for determining evaluation values of action strategies output by the strategy models, and model parameters of the deep reinforcement learning model are updated by the evaluation values;

9. A non-transitory readable storage medium having stored thereon a computer program, wherein the program when executed by a processor implements the deep reinforcement learning-based traffic light control method of any one of claims 1 to 7.

10. A computer device comprising a non-volatile readable storage medium, a processor and a computer program stored on the non-volatile readable storage medium and executable on the processor, characterized in that the processor implements the deep reinforcement learning based traffic light control method of any one of claims 1 to 7 when executing the program.