CN113378306B

CN113378306B - Traffic control method, traffic control device, electronic equipment and computer-readable storage medium

Info

Publication number: CN113378306B
Application number: CN202110927773.4A
Authority: CN
Inventors: 由长喜
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-08-13
Filing date: 2021-08-13
Publication date: 2021-12-03
Anticipated expiration: 2041-08-13
Also published as: CN113378306A

Abstract

The application provides a traffic control method, a traffic control device, electronic equipment and a computer-readable storage medium; the method is applied to the traffic field; the method comprises the following steps: performing sequence prediction processing according to the current lane state of the target traffic intersection and the traffic phase sequence to obtain sequence probabilities corresponding to the traffic phase sequences respectively; selecting a target traffic phase sequence which accords with the current target traffic intersection from the plurality of traffic phase sequences according to the sequence probability; performing phase prediction processing according to the current lane state, the traffic phase and the traffic phase sequence of the target traffic intersection to obtain phase probabilities respectively corresponding to the current traffic phase of the target traffic intersection and candidate traffic phases in the target traffic phase sequence; and selecting the target traffic phase which accords with the current target traffic intersection according to the phase probability. Through the method and the device, the flexibility of traffic control can be improved, and the congestion degree of the traffic intersection is effectively reduced.

Description

Traffic control method, traffic control device, electronic equipment and computer-readable storage medium

Technical Field

The present application relates to traffic technologies, and in particular, to a traffic control method, an apparatus, an electronic device, and a computer-readable storage medium.

Background

The traffic phase is associated with a traffic intersection, and refers to a state combination of orderly arranged colors of a plurality of signal lamps at the traffic intersection (such as an intersection or other types of intersections), and in the daily urban traffic process, the switching of the traffic phase affects the passing efficiency of vehicles at the corresponding traffic intersection.

In the solutions provided in the related art, the traffic phase switching rule of each traffic intersection is usually set in advance by the relevant personnel, so that the signal lamp switches colors according to the solidified traffic phase switching rule, for example, for a certain signal lamp, the rule is that the red light is continuously displayed for 60 seconds, then the green light is continuously displayed for 20 seconds, and then the red light is continuously displayed for 60 seconds, and the above steps are repeated in a cycle. However, due to the high complexity of urban traffic, the driving conditions of vehicles in different periods (such as peak period and off-peak period) may be greatly different, and therefore, the solutions provided by the related art have poor flexibility, which easily results in high congestion degree at the traffic intersection.

Disclosure of Invention

The embodiment of the application provides a traffic control method, a traffic control device, electronic equipment and a computer-readable storage medium, which can improve the flexibility of traffic control and effectively reduce the congestion degree of a traffic intersection.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides a traffic control method, which comprises the following steps:

performing sequence prediction processing according to the current lane state of the target traffic intersection and the traffic phase sequence to obtain sequence probabilities corresponding to the traffic phase sequences respectively; wherein the traffic phase sequence comprises a plurality of traffic phases having a sequence;

taking the traffic phase sequence with the maximum sequence probability as a target traffic phase sequence;

performing phase prediction processing according to the current lane state, the traffic phase and the traffic phase sequence of the target traffic intersection to obtain phase probabilities respectively corresponding to the current traffic phase and the candidate traffic phase of the target traffic intersection; wherein the candidate traffic phase represents a traffic phase in the sequence of target traffic phases that is subsequent to the current traffic phase;

and taking the traffic phase with the maximum phase probability as a target traffic phase, and applying the target traffic phase at the target traffic intersection.

An embodiment of the present application provides a traffic control device, including:

the sequence selection module is used for carrying out sequence prediction processing according to the current lane state of the target traffic intersection and the traffic phase sequence to obtain sequence probabilities corresponding to the traffic phase sequences respectively; wherein the traffic phase sequence comprises a plurality of traffic phases having a sequence;

the sequence selection module is also used for taking the traffic phase sequence with the maximum sequence probability as a target traffic phase sequence;

the phase selection module is used for carrying out phase prediction processing according to the current lane state, the traffic phase and the traffic phase sequence of the target traffic intersection to obtain phase probabilities respectively corresponding to the current traffic phase and the candidate traffic phase of the target traffic intersection; wherein the candidate traffic phase represents a traffic phase in the sequence of target traffic phases that is subsequent to the current traffic phase;

the phase selection module is further configured to use the traffic phase with the largest phase probability as a target traffic phase, and apply the target traffic phase at the target traffic intersection.

An embodiment of the present application provides an electronic device, including:

a memory for storing executable instructions;

and the processor is used for realizing the traffic control method provided by the embodiment of the application when executing the executable instructions stored in the memory.

The embodiment of the application provides a computer-readable storage medium, which stores executable instructions for causing a processor to execute the computer-readable storage medium to implement the traffic control method provided by the embodiment of the application.

The embodiment of the application has the following beneficial effects:

performing sequence prediction processing according to the current lane state and the traffic phase sequence of the target traffic intersection to obtain sequence probabilities corresponding to the traffic phase sequences respectively, and performing macroscopic selection aiming at the traffic phase sequences according to the sequence probabilities to select the target traffic phase sequence which best meets the current condition of the target traffic intersection; and performing phase prediction processing according to the current lane state, the traffic phase and the traffic phase sequence of the target traffic intersection to obtain phase probabilities corresponding to the current traffic phase and the candidate traffic phase respectively, and judging whether to continuously maintain the current traffic phase or apply the candidate traffic phase according to the phase probabilities, so that the finally applied target traffic phase can be ensured to be most consistent with the current situation of the target traffic intersection. Therefore, the adaptability adjustment can be carried out according to the current condition of the target traffic intersection, the flexibility of traffic control is improved, meanwhile, the congestion degree of the traffic intersection can be effectively reduced, and the running efficiency of the vehicle at the traffic intersection is improved.

Drawings

Fig. 1 is a schematic architecture diagram of a traffic control system provided in an embodiment of the present application;

fig. 2 is a schematic architecture diagram of a terminal device provided in an embodiment of the present application;

fig. 3A is a schematic flow chart of a traffic control method according to an embodiment of the present application;

FIG. 3B is a schematic flow chart of reinforcement learning provided by an embodiment of the present application;

FIG. 3C is a schematic flow chart of reinforcement learning provided by an embodiment of the present application;

FIG. 3D is a schematic flow chart of reinforcement learning provided by embodiments of the present application;

fig. 3E is a schematic flow chart of a traffic control method according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a target traffic intersection and an incoming lane provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of reinforcement learning provided by an embodiment of the present application;

FIG. 6 is a schematic diagram of a network architecture provided by an embodiment of the present application;

FIG. 7 is a schematic diagram of a traffic phase provided by an embodiment of the present application;

FIG. 8 is a schematic diagram of a traffic phase sequence provided by an embodiment of the present application;

FIG. 9 is a schematic illustration of an adjacent traffic intersection provided by embodiments of the present application;

FIG. 10 is a schematic diagram of a network architecture of a top-level controller provided by an embodiment of the present application;

fig. 11 is a schematic diagram of a network architecture of an underlying controller provided in an embodiment of the present application;

fig. 12 is a schematic flowchart of model training by a reinforcement learning principle according to an embodiment of the present application.

Detailed Description

In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

In the following description, references to the terms "first \ second \ third" are only to distinguish similar objects and do not denote a particular order, but rather the terms "first \ second \ third" are used to interchange specific orders or sequences, where appropriate, so as to enable the embodiments of the application described herein to be practiced in other than the order shown or described herein. In the following description, the term "plurality" referred to means at least two.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be used for the following explanation.

1) Traffic phase: the color of a plurality of signal lamps (such as all signal lamps) at the traffic intersection is orderly arranged, and the traffic intersection and the traffic phase are in one-to-one correspondence. The embodiment of the present application does not limit the type of the traffic intersection, and for example, the traffic intersection may be an intersection, a t-junction, or the like. It should be noted that applying a traffic phase at a traffic intersection means adjusting a plurality of signal lights of the traffic intersection to be consistent with the traffic phase.

2) Traffic phase sequence: the sequence is obtained by arranging a plurality of traffic phases according to a certain sequence. For each traffic intersection, a corresponding plurality of traffic phase sequences may be preset.

3) A sequence control model: a model constructed based on the Artificial Intelligence (AI) principle for performing sequence prediction processing. Among them, AI is an integrated technique of computer science that attempts to understand the essence of intelligence and produces a new intelligent machine that can react in a manner similar to human intelligence. Similarly, the phase control model is also a model for performing a phase prediction process that is constructed based on the AI principle.

For a traffic intersection, a sequence control model and a plurality of phase control models can be corresponded, wherein each phase control model corresponds to a traffic phase sequence.

4) Reinforcement Learning (RL): also known as refinish learning, evaluation learning or reinforcement learning, is one of the paradigms and methodologies of machine learning to describe and solve the problem of an Agent (Agent) in interacting with the environment to achieve maximum return or achieve a specific goal through learning strategies. In the embodiment of the application, the traffic intersection can be used as an agent to train the model corresponding to the traffic intersection.

5) Simulation (Simulation): in the embodiment of the application, the vehicle driving conditions of a plurality of traffic intersections are simulated (simulated), and in the simulation process, the traffic phase of the traffic intersection can be adjusted, so that reinforcement learning is realized.

6) Loss value: for representing the difference between the output result of the model and the desired result. The training of the model mainly involves two processes of Forward Propagation (Forward Propagation) and backward Propagation (Back Propagation), taking a neural network model comprising an input layer, a hidden layer and an output layer as an example, the Forward Propagation processing refers to processing sequentially through the input layer, the hidden layer and the output layer, and finally obtaining an output result; the back propagation processing means that the weight parameters in each layer are updated by sequentially propagating to the output layer, the hidden layer and the input layer according to the calculated loss values.

The embodiment of the application provides a traffic control method, a traffic control device, electronic equipment and a computer-readable storage medium, which can improve the flexibility of traffic control and effectively reduce the congestion degree of a traffic intersection. An exemplary application of the electronic device provided in the embodiment of the present application is described below, and the electronic device provided in the embodiment of the present application may be implemented as various types of terminal devices, and may also be implemented as a server.

Referring to fig. 1, fig. 1 is a schematic diagram of an architecture of a traffic control system 100 provided in an embodiment of the present application, and a terminal device 400 is connected to a server 200 through a network 300, where the network 300 may be a wide area network or a local area network, or a combination of the two.

In some embodiments, taking the electronic device as a terminal device as an example, the traffic control method provided in the embodiments of the present application may be implemented by the terminal device. For example, for a target traffic intersection needing traffic control in a Road Network (Road Network), the terminal device 400 performs sequence prediction processing according to a current lane state of the target traffic intersection and a traffic phase sequence to obtain sequence probabilities corresponding to a plurality of traffic phase sequences respectively; wherein the traffic phase sequence comprises a plurality of traffic phases having a sequence; taking the traffic phase sequence with the maximum sequence probability as a target traffic phase sequence; performing phase prediction processing according to the current lane state, the traffic phase and the traffic phase sequence of the target traffic intersection to obtain phase probabilities respectively corresponding to the current traffic phase and the candidate traffic phase of the target traffic intersection; wherein the candidate traffic phase represents a traffic phase in the sequence of target traffic phases that is subsequent to the current traffic phase; and taking the traffic phase with the maximum phase probability as a target traffic phase, and applying the target traffic phase at the target traffic intersection. The road network refers to a road system formed by interconnecting and interlacing various roads in a certain area and distributed in a net shape.

It is to be noted that the terminal device 400 may implement the sequence prediction processing through a sequence control model, and implement the phase prediction processing through a phase control model. Before that, the terminal device 400 may train the sequence control model and the phase control model based on the reinforcement learning principle, and store the trained sequence control model and the trained phase control model to the local for easy calling.

It is worth mentioning that when the terminal device 400 is a traffic phase control terminal (e.g., a terminal for controlling the color of a signal lamp) of the target traffic intersection, the terminal device 400 may directly apply the target traffic phase at the target traffic intersection so that the traffic phase of the target traffic intersection at the next moment coincides with the target traffic phase. When the terminal device 400 is not a traffic phase control terminal of the target traffic intersection, the terminal device 400 may transmit the calculated target traffic phase to a traffic phase control terminal of the target traffic intersection, so that the traffic phase control terminal applies the received target traffic phase at the target traffic intersection.

In some embodiments, taking the electronic device as a server as an example, the traffic control method provided in the embodiments of the present application may also be implemented by the server. For example, the server 200 performs a series of processing according to the current lane state, traffic phase and traffic phase sequence of the target traffic intersection to obtain the target traffic phase, and applies the target traffic phase to the target traffic intersection, for example, notifies the traffic phase control terminal of the target traffic intersection to apply the target traffic phase to the target traffic intersection.

Similarly, the server 200 may train the sequence control model and the phase control model based on the reinforcement learning principle, and store the trained sequence control model and the trained phase control model to the local (for example, in the distributed file system of the server 200), so as to subsequently call the trained sequence control model to realize the sequence prediction processing, and call the trained phase control model to realize the phase prediction processing.

In some embodiments, the traffic control method provided in the embodiments of the present application may also be implemented by the terminal device and the server in a cooperative manner. For example, the terminal device 400 may transmit the acquired current lane state, traffic phase, and traffic phase sequence of the target traffic intersection to the server 200, so that the server 200 calculates the target traffic phase. The server 200 may transmit the calculated target traffic phase to the terminal device 400 to cause the terminal device 400 to apply the target traffic phase at the target traffic intersection.

For another example, the server 200 may transmit the trained sequence control model and the trained phase control model to the terminal apparatus 400, so that the terminal apparatus 400 has the capabilities of the sequence prediction process and the phase prediction process.

In some embodiments, the Traffic control System 100 shown in fig. 1 may be implemented as or as part of an Intelligent Transportation System (ITS). The Intelligent Transportation System is a comprehensive Transportation System which effectively and comprehensively applies advanced scientific technologies (such as information technology, computer technology, data communication technology, sensor technology, electronic control technology, automatic control theory, operation research, artificial intelligence and the like) to Transportation, service control and vehicle manufacturing, and strengthens the relation among vehicles, roads and users, thereby forming the comprehensive Transportation System which ensures safety, improves efficiency, improves environment and saves energy.

In some embodiments, the terminal device 400 or the server 200 may implement the traffic control method provided by the embodiments of the present application by running a computer program, for example, the computer program may be a native program or a software module in an operating system; can be a local (Native) Application program (APP), i.e. a program (such as the client 410 shown in fig. 1) that needs to be installed in an operating system to run, for example, can be an Application program for controlling signal lights at a traffic intersection; or may be an applet, i.e. a program that can be run only by downloading it to the browser environment; but also an applet that can be embedded into any APP. In general, the computer programs described above may be any form of application, module or plug-in.

In some embodiments, the server 200 may be an independent physical server, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), and a big data and artificial intelligence platform, where the cloud service may be a traffic control service for the terminal device 400 to call. The terminal device 400 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart television, a smart watch, and the like. The terminal device and the server may be directly or indirectly connected through wired or wireless communication, and the embodiment of the present application is not limited.

Taking the electronic device provided in the embodiment of the present application as an example for illustration, it can be understood that, for the case where the electronic device is a server, parts (such as the user interface, the presentation module, and the input processing module) in the structure shown in fig. 2 may be default. Referring to fig. 2, fig. 2 is a schematic structural diagram of a terminal device 400 provided in an embodiment of the present application, where the terminal device 400 shown in fig. 2 includes: at least one processor 410, memory 450, at least one network interface 420, and a user interface 430. The various components in the terminal device 400 are coupled together by a bus system 440. It is understood that the bus system 440 is used to enable communications among the components. The bus system 440 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 440 in fig. 2.

The Processor 410 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.

The user interface 430 includes one or more output devices 431, including one or more speakers and/or one or more visual displays, that enable the presentation of media content. The user interface 430 also includes one or more input devices 432, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

The memory 450 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 450 optionally includes one or more storage devices physically located remote from processor 410.

The memory 450 includes either volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a Random Access Memory (RAM). The memory 450 described in embodiments herein is intended to comprise any suitable type of memory.

In some embodiments, memory 450 is capable of storing data, examples of which include programs, modules, and data structures, or a subset or superset thereof, to support various operations, as exemplified below.

An operating system 451, including system programs for handling various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and handling hardware-based tasks;

a network communication module 452 for communicating to other electronic devices via one or more (wired or wireless) network interfaces 420, exemplary network interfaces 420 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), etc.;

a presentation module 453 for enabling presentation of information (e.g., user interfaces for operating peripherals and displaying content and information) via one or more output devices 431 (e.g., display screens, speakers, etc.) associated with user interface 430;

an input processing module 454 for detecting one or more user inputs or interactions from one of the one or more input devices 432 and translating the detected inputs or interactions.

In some embodiments, the apparatus provided in the embodiments of the present application may be implemented in software, and fig. 2 shows a traffic control apparatus 455 stored in a memory 450, which may be software in the form of programs and plug-ins, and the like, and includes the following software modules: a sequence selection module 4551 and a phase selection module 4552, which are logical and thus may be arbitrarily combined or further split depending on the functions implemented. The functions of the respective modules will be explained below.

The traffic control method provided by the embodiment of the present application will be described in conjunction with exemplary applications and implementations of the electronic device provided by the embodiment of the present application.

Referring to fig. 3A, fig. 3A is a schematic flow chart of a traffic control method according to an embodiment of the present application, and will be described with reference to the steps shown in fig. 3A.

In step 101, sequence prediction processing is carried out according to the current lane state of the target traffic intersection and the traffic phase sequence to obtain sequence probabilities corresponding to the traffic phase sequences respectively; wherein the traffic phase sequence comprises a plurality of traffic phases having a sequence.

Here, the target traffic intersection may be any one of traffic intersections in any region (such as a certain county or a certain city). In the process of controlling the traffic of the target traffic intersection, the current lane state, the traffic phase and the traffic phase sequence of the target traffic intersection are firstly obtained, wherein the lane state is the vehicle driving state of at least one lane corresponding to the target traffic intersection. And then, performing sequence prediction processing according to the acquired current lane state of the target traffic intersection and the acquired traffic phase sequence to obtain sequence probabilities respectively corresponding to the traffic phase sequences, wherein the sequence probabilities represent the matching degree of the corresponding traffic phase sequences and the current situation of the target traffic intersection.

It should be noted that the embodiment of the present application does not limit the manner of acquiring the lane state and the traffic phase, and for example, the target traffic intersection may be photographed to obtain an image, and the image may be analyzed to obtain the lane state and the traffic phase; or the lane state and the traffic phase may also be obtained from an internet of things terminal corresponding to the target traffic intersection, such as a lane state sensing terminal (for example, a sensing terminal arranged on a road surface and used for sensing a fleet length, a vehicle speed, a vehicle waiting time and the like of the road surface), a traffic phase control terminal (for example, a terminal used for controlling a color of a signal lamp) and the like. In addition, the current traffic phase sequence of the target traffic intersection is the last determined target traffic phase sequence.

It is worth to be noted that, in the embodiment of the present application, the number of traffic phases included in the traffic phase sequence is not limited, and the generation manner of the traffic phase sequence is also not limited. For example, exhaustive combination processing may be performed on a plurality of traffic phases supported by a target traffic intersection to obtain a plurality of traffic phase sequences; alternatively, the related personnel can manually set a plurality of traffic phase sequences according to the vehicle driving characteristics of the target traffic intersection.

In some embodiments, before step 101, further comprising: and combining the current fleet length, the vehicle speed and the vehicle waiting time of the driving lane of the target traffic intersection to obtain the current lane state of the target traffic intersection.

For example, the current lane status of the target traffic intersection includes, but is not limited to, a current fleet length of the incoming lanes (partial incoming lanes or all incoming lanes) of the target traffic intersection, a vehicle speed, and a vehicle waiting duration, where the fleet length may be a total length of all vehicles waiting (e.g., waiting if the traffic light is red), the vehicle speed may be a real-time vehicle speed or an average vehicle speed over a past period of time, and the vehicle waiting duration may be a first vehicle waiting duration, where the first vehicle is a vehicle that can enter the target traffic intersection for the first time after the waiting is finished. For ease of understanding, a schematic diagram of the target traffic intersection as shown in fig. 4 is provided, and the entering lanes 1 to 4 are shown in fig. 4 taking the case where the target traffic intersection is an intersection as an example.

By the mode, the current situation of the target traffic intersection can be comprehensively reflected on the basis of the fleet length, the vehicle speed and the vehicle waiting time, and accurate sequence prediction processing and phase prediction processing are facilitated.

In step 102, the traffic phase sequence with the highest sequence probability is used as the target traffic phase sequence.

After the sequence probabilities corresponding to the traffic phase sequences are obtained, the traffic phase sequence with the maximum sequence probability is used as a target traffic phase sequence, so that the macroscopic selection of the traffic phase sequences is realized based on the sequence probabilities, and the selected target traffic phase sequence is ensured to be most consistent with the current situation of a target traffic intersection. Wherein, the target traffic phase sequence can be the same as or different from the current traffic phase sequence at the target traffic intersection.

In step 103, performing phase prediction processing according to the current lane state, the traffic phase and the traffic phase sequence of the target traffic intersection to obtain phase probabilities respectively corresponding to the current traffic phase and the candidate traffic phase of the target traffic intersection; wherein the candidate traffic phase represents a traffic phase in the target traffic phase sequence that is subsequent to the current traffic phase.

Here, a traffic phase (for example, a next traffic phase after the current traffic phase) in the target traffic phase sequence is used as a candidate traffic phase, and phase prediction processing is performed according to the current lane state of the target traffic intersection, the traffic phase and the traffic phase sequence to obtain a phase probability corresponding to the current traffic phase of the target traffic intersection and a phase probability corresponding to the candidate traffic phase, wherein the phase probability represents a matching degree of the corresponding traffic phase and the current situation of the target traffic intersection.

In some embodiments, the sequence prediction process according to the current lane state and traffic phase sequence of the target traffic intersection can be implemented in such a way that: when the sequence prediction period arrives, performing sequence prediction processing according to the current lane state of the target traffic intersection and the traffic phase sequence; the phase prediction processing according to the current lane state, traffic phase and traffic phase sequence of the target traffic intersection can be realized in such a way that: when the phase prediction period arrives, performing phase prediction processing according to the current lane state, traffic phase and traffic phase sequence of the target traffic intersection; wherein the sequence prediction period is greater than the phase prediction period.

In the present embodiment, the traffic control may be performed periodically. For example, when the sequence prediction period arrives, the sequence prediction processing is carried out according to the current lane state of the target traffic intersection and the traffic phase sequence; and when the phase prediction period arrives, performing phase prediction processing according to the current lane state, the traffic phase and the traffic phase sequence of the target traffic intersection. Compared with the traffic phase, the traffic phase sequence is more macroscopic, and the effective application duration of the traffic phase sequence is usually longer, so that the sequence prediction period can be set to be longer than the phase prediction period, for example, the sequence prediction period is 900 seconds, and the phase prediction period is 15 seconds. Therefore, the same traffic phase sequence (namely, the target traffic phase sequence) can be applied in one sequence prediction period, the same traffic phase (namely, the target traffic phase) can be applied in one phase prediction period, and the consumption of computing resources can be reduced on the basis of ensuring the traffic control effect, namely, the execution times of sequence prediction processing are reduced.

In some embodiments, the sequence prediction process according to the current lane state and traffic phase sequence of the target traffic intersection can be implemented in such a way that: performing sequence prediction processing on the current lane state and traffic phase sequence of the target traffic intersection through a sequence control model; the phase prediction processing according to the current lane state, traffic phase and traffic phase sequence of the target traffic intersection can be realized in such a way that: performing phase prediction processing on the current lane state, the traffic phase and the traffic phase sequence of the target traffic intersection through a phase control model corresponding to the target traffic phase sequence; the traffic phase sequences respectively correspond to a phase control model.

The embodiment of the application can be realized by combining the principle of artificial intelligence, for example, a target traffic intersection corresponds to a sequence control model and a plurality of phase control models, wherein the plurality of phase control models correspond to different traffic phase sequences respectively. In this case, the sequence control model may perform sequence prediction processing on the current lane state and the traffic phase sequence of the target traffic intersection, and when the target traffic phase sequence is determined, perform phase prediction processing on the current lane state, the traffic phase and the traffic phase sequence of the target traffic intersection through the phase control model corresponding to the target traffic phase sequence. Therefore, the accuracy of the sequence prediction processing and the phase prediction processing can be improved. Before this, can also train sequence control model and a plurality of phase control models to further promote the processing accuracy.

In step 104, the traffic phase with the highest phase probability is taken as the target traffic phase, and the target traffic phase is applied at the target traffic intersection.

The traffic phase with the maximum phase probability is used as the target traffic phase, and the target traffic phase is applied to the target traffic intersection, so that the bottom-layer selection of the traffic phase is realized on the basis of the macroscopically selected target traffic phase sequence, and the accuracy of traffic control can be improved.

It is worth to be noted that, when the target traffic phase is the current traffic phase of the target traffic intersection, applying the target traffic phase at the target traffic intersection means that the current traffic phase of the target traffic intersection is kept unchanged; when the target traffic phase is the candidate traffic phase, applying the target traffic phase at the target traffic intersection means switching the current traffic phase of the target traffic intersection to the candidate traffic phase.

As shown in fig. 3A, by determining the target traffic phase sequence and the target traffic phase, the embodiment of the present application can effectively reduce the congestion degree at the target traffic intersection, reduce the number of times of stopping vehicles, improve the passing rate and speed of the vehicles at the target traffic intersection, and achieve good traffic control.

In some embodiments, referring to fig. 3B, fig. 3B is a schematic flowchart of reinforcement learning provided by the embodiments of the present application, and will be described with reference to the steps shown in fig. 3B.

In step 201, an environment simulation process is performed on the target traffic intersection to obtain a simulated lane state, a simulated traffic phase, and a simulated traffic phase sequence.

Here, the sequence control model and the phase control model may be trained based on the reinforcement learning principle, thereby improving the accuracy of the subsequent processing.

For example, the environment simulation processing may be performed on the target traffic intersection to obtain a simulated lane state, a simulated traffic phase, and a simulated traffic phase sequence of the target traffic intersection in the simulation environment, where the simulated traffic phase is one of a plurality of traffic phases corresponding to the target traffic intersection, and similarly, the simulated traffic phase sequence is one of a plurality of traffic phase sequences corresponding to the target traffic intersection.

It is worth mentioning that the environment Simulation process is used for simulating a target traffic intersection in a real environment, and the environment Simulation process can be implemented by an open source Simulation tool, such as a city traffic Simulation (SUMO) tool. In order to improve the accuracy of the environment simulation process, the environment simulation process may be performed on a plurality of traffic intersections (e.g., all traffic intersections in a certain area) including the target traffic intersection.

In step 202, sequence prediction processing is performed on the current simulated lane state and the simulated traffic phase sequence of the target traffic intersection through a sequence control model, so as to obtain training sequence probabilities respectively corresponding to the plurality of traffic phase sequences.

The sequence control model is used for carrying out sequence prediction processing on the current simulated lane state and the simulated traffic phase sequence of the target traffic intersection to obtain sequence probabilities corresponding to the traffic phase sequences respectively, and for convenience of distinguishing, the obtained sequence probabilities are named as training sequence probabilities.

In step 203, the traffic phase sequence with the highest probability of the training sequence is used as the training traffic phase sequence.

Here, the maximum probability of the training sequence is selected, and the traffic phase sequence corresponding to the maximum probability of the training sequence is used as the training traffic phase sequence.

In step 204, phase prediction processing is performed on the current simulated lane state, the simulated traffic phase and the simulated traffic phase sequence of the target traffic intersection through a phase control model corresponding to the training traffic phase sequence, so as to obtain training phase probabilities respectively corresponding to the current simulated traffic phase and the training candidate traffic phase of the target traffic intersection; wherein the training candidate traffic phase represents a traffic phase in the training traffic phase sequence that is located after the current simulated traffic phase.

Here, a traffic phase (for example, a next traffic phase after the current simulated traffic phase) in the training traffic phase sequence after the current simulated traffic phase is taken as a training candidate traffic phase, and a phase control model corresponding to the training traffic phase sequence performs phase prediction processing on the current simulated lane state, the simulated traffic phase and the simulated traffic phase sequence of the target traffic intersection to obtain phase probabilities respectively corresponding to the current simulated traffic phase and the training candidate traffic phase of the target traffic intersection, where the obtained phase probabilities are named as training phase probabilities for convenience of distinction.

In step 205, the traffic phase with the maximum probability of the training phase is used as the training target traffic phase, and the training target traffic phase is applied to the simulation of the target traffic intersection.

Here, the maximum training phase probability is selected, and the traffic phase corresponding to the maximum training phase probability is taken as the training target traffic phase. Then, a training target traffic phase is simulated and applied in a simulation environment of the target traffic intersection, namely, the situation after the training target traffic phase is applied at the target traffic intersection in a real environment is simulated.

In step 206, a new simulated lane state obtained after the target traffic intersection simulation application trains the target traffic phase is determined, and a control award is determined based on the new simulated lane state.

Here, a new simulated lane state obtained after the training target traffic phase is applied to the target traffic intersection simulation is determined, where the new simulated lane state may be a simulated lane state at a certain time after the training target traffic phase is applied, or an average simulated lane state in a certain time period after the training target traffic phase is applied. And then determining a control reward according to the new simulated lane state, wherein the control reward is used for expressing the positive effect on traffic control after the training target traffic phase is applied, and the positive effect is stronger when the control reward is larger.

In some embodiments, the new simulated lane state includes first state data and second state data; the congestion degree of the target traffic intersection is negatively correlated with the first state data and positively correlated with the second state data; the above described determination of the control reward according to the new simulated lane status may be implemented in such a way that: performing state data fusion processing on the first state data and the second state data to obtain control rewards; wherein the control award is positively correlated with the first status data and negatively correlated with the second status data.

Here, the new simulated lane status may include first status data and second status data, wherein the congestion degree of the target traffic intersection is negatively correlated with the first status data and positively correlated with the second status data, for example, the first status data may include a vehicle speed, and the second status data may include a vehicle fleet length and a vehicle waiting time period. In this case, the first state data and the second state data may be subjected to state data fusion processing to obtain the control reward, where the control reward is positively correlated with the first state data and negatively correlated with the second state data, and the state data fusion processing mode includes, but is not limited to, weighted summation.

Of course, the new simulated lane state may also include only the first state data, or only the second state data. For example, in the case where the new simulated lane state includes only the vehicle speed, the vehicle speed may be directly used as the control reward, or the control reward may be obtained by performing some processing (for example, multiplying the vehicle speed by a positive weight parameter) and may be positively correlated with the vehicle speed. For another example, in the case that the new simulated lane status only includes the fleet length and the vehicle waiting time, the status data fusion processing may be performed on the fleet length and the vehicle waiting time to obtain the control award, where the control award is negatively related to the fleet length and negatively related to the vehicle waiting time. By the aid of the mode, accuracy of calculation control of rewards can be improved, and accuracy of reinforcement learning is improved.

In step 207, a sequence control model and a phase control model corresponding to the training traffic phase sequence are intensively learned according to the control reward.

Here, the sequence control model and the phase control model corresponding to the training traffic phase sequence are intensively learned, i.e., model training is performed, according to the control reward. The reinforcement learning algorithm in the embodiment of the present application is not limited, and may be, for example, an advantageous action review (A2C) algorithm, an Asynchronous advantageous action review (A3C) algorithm, or a Deep Q-value Network (DQN) algorithm.

In some embodiments, the above sequence prediction process of the current simulated lane state and the simulated traffic phase sequence at the target traffic intersection by the sequence control model can be implemented by: when the sequence prediction period is reached, performing sequence prediction processing on the current simulated lane state and the simulated traffic phase sequence of the target traffic intersection through a sequence control model; the phase prediction processing of the current simulated lane state, the simulated traffic phase and the simulated traffic phase sequence of the target traffic intersection by the phase control model corresponding to the training traffic phase sequence can be realized in such a way that: when the phase prediction period is reached, performing phase prediction processing on the current simulated lane state, the current simulated traffic phase and the current simulated traffic phase sequence of the target traffic intersection through a phase control model corresponding to the training traffic phase sequence; wherein the sequence prediction period is greater than the phase prediction period.

Similarly, the sequence prediction process may be performed every other sequence prediction period, and the phase prediction process may be performed every other phase prediction period, where the sequence prediction period is greater than the phase prediction period.

It should be noted that, in this case, for the phase control model corresponding to the training traffic phase sequence, a control reward may be determined according to the calculated training target traffic phase every other phase prediction period, where the control reward is used to perform reinforcement learning, i.e., training, on the phase control model corresponding to the training traffic phase sequence; for the sequence control model, every other sequence prediction period, control rewards corresponding to all phase prediction periods in the sequence prediction period can be subjected to reward fusion processing to obtain fusion control rewards, and the fusion control rewards are used for performing reinforcement learning, namely training on the sequence control model. The manner of rewarding the fusion process includes, but is not limited to, a summation process, a weighted summation, and the like.

As shown in fig. 3B, in the embodiment of the present application, a sequence control model and a plurality of phase control models corresponding to a target traffic intersection can be effectively trained in an environment simulation processing manner, so as to improve the subsequent traffic control effect.

In some embodiments, referring to fig. 3C, fig. 3C is a schematic flow chart of reinforcement learning provided in the embodiments of the present application, and based on fig. 3B, in step 202, in step 301, a sequence value prediction process may be performed on a current simulated lane state and a simulated traffic phase sequence of the target traffic intersection through a sequence control model, so as to obtain a sequence value.

In the embodiment of the present application, the reinforcement learning may be performed by A2C, where A2C relates to an actor branch and a critic branch, and for the sequence control model, the probability of the training sequences corresponding to the multiple traffic phase sequences is the output of the actor branch of the sequence control model.

When the sequence control model carries out sequence prediction processing on the current simulated lane state and the simulated traffic phase sequence of the target traffic intersection, the sequence control model can also carry out sequence Value prediction processing on the current simulated lane state and the simulated traffic phase sequence of the target traffic intersection to obtain a Value (Value), and the Value is named as a sequence Value for the convenience of distinguishing. The sequence value here is the output of the criticc branch of the sequence control model.

Wherein, the sequence prediction process and the sequence value prediction process can share at least part of the network layer in the sequence control model.

In some embodiments, the sequence control model includes a fully-connected network and a memory transfer network, the fully-connected network including fully-connected sub-networks corresponding to a current simulated lane state and a simulated traffic phase sequence of the target traffic intersection, respectively; the sequence prediction processing of the current simulated lane state and the simulated traffic phase sequence of the target traffic intersection by the sequence control model can be realized in such a way, and the training sequence probabilities respectively corresponding to the plurality of traffic phase sequences are obtained: performing full-connection processing on the current simulated lane state of the target traffic intersection through a full-connection sub-network corresponding to the current simulated lane state of the target traffic intersection to obtain a full-connection result corresponding to the current simulated lane state of the target traffic intersection; performing full-connection processing on the current simulated traffic phase sequence of the target traffic intersection through a full-connection sub-network corresponding to the current simulated traffic phase sequence of the target traffic intersection to obtain a full-connection result corresponding to the current simulated traffic phase sequence of the target traffic intersection; the current simulation lane state of the target traffic intersection and the full-connection result corresponding to the simulation traffic phase sequence are subjected to memory transfer processing through a memory transfer network, and a memory transfer result is obtained; and carrying out probability normalization processing on the memory transfer result to obtain training sequence probabilities respectively corresponding to the plurality of traffic phase sequences.

Here, an example of a network architecture of a sequence control model is provided, that is, the sequence control model includes a Full Connected (FC) network and a memory transfer network, where the full Connected network includes a full Connected sub-network (the full Connected sub-network may be a full Connected layer) corresponding to a current simulated lane state of the target traffic intersection, and a full Connected sub-network corresponding to a current simulated traffic phase sequence of the target traffic intersection, and in a case that the current simulated lane state includes a plurality of state data (such as a fleet length, a vehicle speed, a vehicle waiting time, and the like), the current simulated lane state may be further subdivided into the full Connected sub-networks respectively corresponding to the plurality of state data; the Memory transfer network supports Memory transfer between different network layers inside, for example, the Memory transfer network may be a Long Short-Term Memory (LSTM) network, but is not limited thereto.

In this way, the full-connection processing can be performed on the current simulated lane state through the full-connection sub-network corresponding to the current simulated lane state in the sequence control model, so that a full-connection result corresponding to the current simulated lane state is obtained; and carrying out full connection processing on the current simulated traffic phase sequence through a full connection sub-network corresponding to the current simulated traffic phase sequence in the sequence control model to obtain a full connection result corresponding to the current simulated traffic phase sequence. And then, carrying out memory transfer processing on the full-connection result corresponding to the current simulated lane state and the full-connection result corresponding to the current simulated traffic phase sequence through a memory transfer network in the sequence control model to obtain a memory transfer result. And finally, performing probability normalization processing on the memory transfer result to map the memory transfer result into training sequence probabilities corresponding to the traffic phase sequences respectively, wherein the probability normalization processing can be realized by a Softmax function, but is not limited to the above, and the process of the probability normalization processing can be regarded as an operator branch of the sequence control model. Through the network architecture, the accuracy of sequence prediction processing can be improved.

In some embodiments, the sequence value prediction processing of the current simulated lane state and the simulated traffic phase sequence at the target traffic intersection by the sequence control model can be implemented in such a manner as to obtain a sequence value: and performing linear regression processing on the memory transfer result to obtain a sequence value.

Here, the memory transfer result output by the memory transfer network in the sequence control model may be subjected to linear regression processing to obtain the sequence value. Here, the process of the linear regression process can be regarded as the criticc branch of the sequence control model.

In fig. 3C, at the same time as step 204 shown in fig. 3B, in step 302, a phase value prediction process may be performed on the current simulated lane state, the simulated traffic phase, and the simulated traffic phase sequence of the target traffic intersection by training a phase control model corresponding to the traffic phase sequence, so as to obtain a phase value.

Similarly, the phase control model may include an operator branch and a critic branch. For example, for the phase control model corresponding to the training traffic phase sequence, the training phase probabilities corresponding to the current simulated traffic phase and the training candidate traffic phase at the target traffic intersection are the outputs of the operator branch.

When the phase prediction processing is carried out on the current simulated lane state, the simulated traffic phase and the simulated traffic phase sequence of the target traffic intersection through the phase control model corresponding to the training traffic phase sequence, the phase value prediction processing can also be carried out on the current simulated lane state, the simulated traffic phase and the simulated traffic phase sequence of the target traffic intersection through the phase control model corresponding to the training traffic phase sequence, and the phase value is obtained, and the phase value is the output of the critic branch in the phase control model corresponding to the training traffic phase sequence.

The sequence value, the phase value and the control reward are commonly used for carrying out reinforcement learning on the sequence control model and the phase control model corresponding to the training traffic phase sequence.

In some embodiments, the phase control model corresponding to the training traffic phase sequence comprises a fully-connected network and a memory transfer network, the fully-connected network comprises fully-connected sub-networks respectively corresponding to the current simulated lane state, the simulated traffic phase and the simulated traffic phase sequence of the target traffic intersection; the phase prediction processing of the current simulated lane state, the simulated traffic phase and the simulated traffic phase sequence of the target traffic intersection by the phase control model corresponding to the training traffic phase sequence can be realized in such a way, and the training phase probabilities respectively corresponding to the current simulated traffic phase and the training candidate traffic phase of the target traffic intersection are obtained: performing full-connection processing on the current simulated lane state of the target traffic intersection through a full-connection sub-network corresponding to the current simulated lane state of the target traffic intersection to obtain a full-connection result corresponding to the current simulated lane state of the target traffic intersection; performing full-connection processing on the current simulated traffic phase of the target traffic intersection through a full-connection sub-network corresponding to the current simulated traffic phase of the target traffic intersection to obtain a full-connection result corresponding to the current simulated traffic phase of the target traffic intersection; performing full-connection processing on the current simulated traffic phase sequence of the target traffic intersection through a full-connection sub-network corresponding to the current simulated traffic phase sequence of the target traffic intersection to obtain a full-connection result corresponding to the current simulated traffic phase sequence of the target traffic intersection; carrying out memory transfer processing on full connection results respectively corresponding to the current simulated lane state, the simulated traffic phase and the simulated traffic phase sequence of the target traffic intersection through a memory transfer network to obtain a memory transfer result; and carrying out probability normalization processing on the memory transfer result to obtain training phase probabilities respectively corresponding to the current simulated traffic phase and the training candidate traffic phase of the target traffic intersection.

Similar to the sequence control model, the full-connection processing can be performed on the current simulated lane state of the target traffic intersection through a full-connection sub-network corresponding to the current simulated lane state in the phase control model corresponding to the training traffic phase sequence, so that a full-connection result corresponding to the current simulated lane state of the target traffic intersection is obtained; performing full-connection processing on the current simulated traffic phase of the target traffic intersection through a full-connection sub-network corresponding to the current simulated traffic phase in a phase control model corresponding to the training traffic phase sequence to obtain a full-connection result corresponding to the current simulated traffic phase of the target traffic intersection; and carrying out full connection processing on the current simulated traffic phase sequence of the target traffic intersection through a full connection sub-network corresponding to the current simulated traffic phase sequence in the phase control model corresponding to the training traffic phase sequence to obtain a full connection result corresponding to the current simulated traffic phase sequence of the target traffic intersection. And then, carrying out memory transfer processing on full connection results respectively corresponding to the current simulated lane state, the simulated traffic phase and the simulated traffic phase sequence of the target traffic intersection through a memory transfer network in the phase control model corresponding to the training traffic phase sequence to obtain a memory transfer result. And finally, carrying out probability normalization processing on the memory transfer result to obtain training phase probabilities respectively corresponding to the current simulated traffic phase and the training candidate traffic phase of the target traffic intersection. The probability normalization process here can also be implemented by a Softmax function, but is not limited to this, and the process of the probability normalization process here can be regarded as an operator branch in a phase control model corresponding to the training traffic phase sequence.

In some embodiments, the phase value prediction processing on the current simulated lane state, the simulated traffic phase, and the simulated traffic phase sequence of the target traffic intersection by training the phase control model corresponding to the traffic phase sequence can be implemented in such a manner as to obtain the phase value: and performing linear regression processing on the memory transfer result to obtain the phase value.

Here, the linear regression processing may be performed on the memory transfer result output by the memory transfer network in the phase control model corresponding to the training traffic phase sequence, so as to obtain the phase value. Here, the process of the linear regression process can be regarded as a critic branch of the phase control model corresponding to the training traffic phase sequence.

In fig. 3C, step 207 shown in fig. 3B can be implemented by steps 303 to 306, and will be described with reference to each step.

In step 303, a first phase loss value is determined according to the control reward and the phase value, and a second phase loss value is determined according to the control reward, the phase value, and a training phase probability corresponding to the training target traffic phase.

Here, the A2C algorithm includes a first loss function and a second loss function, the parameters related to the first loss function include the control reward and the output of the critic branch, and the parameters related to the second loss function include the control reward, the output of the critic branch, and the output of the operator branch.

Therefore, for the phase control model corresponding to the training traffic phase sequence, the control reward and the phase value are substituted into the first loss function to obtain a first phase loss value; and substituting the control reward, the phase value and the training phase probability corresponding to the training target traffic phase into the second loss function to obtain a second phase loss value.

In step 304, a phase control model corresponding to the training traffic phase sequence is trained according to the first phase loss value and the second phase loss value.

Here, the phase control model corresponding to the training traffic phase sequence is trained according to the first phase loss value and the second phase loss value, that is, the reinforcement learning of the phase control model corresponding to the training traffic phase sequence is realized. Wherein, the model training can be realized by combining the modes of back propagation and gradient descent.

In some embodiments, the training of the phase control model corresponding to the traffic phase sequence according to the first phase loss value and the second phase loss value may be implemented in such a manner that: any one of the following processes is performed: sequentially training phase control models corresponding to the training traffic phase sequence according to the first phase loss value and the second phase loss value; the training priority of the first phase loss value is greater than that of the second phase loss value, or the training priority of the second phase loss value is greater than that of the first phase loss value; and performing loss value fusion processing on the first phase loss value and the second phase loss value, and training a phase control model corresponding to the training traffic phase sequence according to the obtained fusion loss value.

Here, two training methods are provided for training the phase control model corresponding to the traffic phase sequence, which will be separately described.

1) And training the phase control models corresponding to the training traffic phase sequence in sequence according to the first phase loss value and the second phase loss value. For example, when the training priority of the first phase loss value is greater than the training priority of the second phase loss value, the phase control model corresponding to the training traffic phase sequence is trained according to the first phase loss value, and then the phase control model corresponding to the training traffic phase sequence is trained according to the second phase loss value. And under the condition that the training priority of the second phase loss value is greater than that of the first phase loss value, training the phase control model corresponding to the training traffic phase sequence according to the second phase loss value, and then training the phase control model corresponding to the training traffic phase sequence according to the first phase loss value.

2) And performing loss value fusion processing on the first phase loss value and the second phase loss value, and training the phase control model corresponding to the training traffic phase sequence according to the obtained fusion loss value. The loss value fusion process includes, but is not limited to, a summation process, a weighted summation, an averaging process, a weighted averaging process, and the like.

In step 305, a first sequence loss value is determined according to the control reward and the sequence value, and a second sequence loss value is determined according to the control reward, the sequence value and the training sequence probability corresponding to the training traffic phase sequence.

Similarly, for the sequence control model, substituting the control reward and the sequence value into the first loss function to obtain a first sequence loss value; and substituting the control reward, the sequence value and the training sequence probability corresponding to the training traffic phase sequence into the second loss function to obtain a second sequence loss value.

It is noted that the control reward in step 305 may be replaced by a fusion control reward for the case of application to the sequence prediction period as well as the phase prediction period.

In step 306, a sequence control model is trained according to the first sequence loss value and the second sequence loss value.

Here, the sequence control model is trained based on the first sequence loss value and the second sequence loss value, that is, the sequence control model is intensively learned.

In some embodiments, the training of the sequence control model according to the first sequence loss value and the second sequence loss value as described above may be implemented in such a way that: any one of the following processes is performed: training the sequence control model in sequence according to the first sequence loss value and the second sequence loss value; the training priority of the first sequence loss value is greater than that of the second sequence loss value, or the training priority of the second sequence loss value is greater than that of the first sequence loss value; and carrying out loss value fusion processing on the first sequence loss value and the second sequence loss value, and training a sequence control model according to the obtained fusion loss value.

As shown in fig. 3C, the embodiment of the present application performs reinforcement learning in combination with the method A2C, so as to further improve the effect of model training.

In some embodiments, referring to fig. 3D, fig. 3D is a schematic flow chart of reinforcement learning provided in the embodiments of the present application, step 202 shown in fig. 3B may be updated to step 401, and in step 401, sequence prediction processing is performed on current simulated lane states and simulated traffic phase sequences of a plurality of traffic intersections in a combined traffic intersection through a sequence control model, so as to obtain training sequence probabilities respectively corresponding to the plurality of traffic phase sequences; wherein the combined traffic intersection includes the target traffic intersection and an adjacent traffic intersection to the target traffic intersection.

In the present embodiment, the traffic intersection is not independent, but has a certain relationship with other traffic intersections, for example, a certain vehicle exits from the traffic intersection a and enters the traffic intersection B. Therefore, in the process of reinforcement learning, in addition to the case of reference to the target traffic intersection, the case of an adjacent traffic intersection to the target traffic intersection can be referred to.

For example, the sequence control model may perform sequence prediction processing on current simulated lane states and simulated traffic phase sequences of a plurality of traffic intersections in the combined traffic intersection to obtain training sequence probabilities corresponding to the plurality of traffic phase sequences, where the combined traffic intersection includes a target traffic intersection and an adjacent traffic intersection of the target traffic intersection, that is, the combined traffic intersection may be understood as a set of the target traffic intersection and the adjacent traffic intersection.

In some embodiments, before step 401, further comprising: performing any one of the following processes for the target traffic intersection: taking a traffic intersection communicated with a target traffic intersection as an adjacent traffic intersection; wherein, other traffic intersections are not arranged between the target traffic intersection and the communicated traffic intersection; taking the traffic intersection with the distance between the traffic intersection and the target traffic intersection smaller than the distance threshold as an adjacent traffic intersection; the method comprises the steps of obtaining a plurality of vehicle driving records comprising a target traffic intersection, and taking the traffic intersection which is included in the vehicle driving records with the occurrence frequency larger than a frequency threshold value and is different from the target traffic intersection as an adjacent traffic intersection.

The embodiments of the present application provide three ways of determining an adjacent traffic intersection of a target traffic intersection, which will be described separately.

1) And taking the traffic intersection communicated with the target traffic intersection as an adjacent traffic intersection, wherein other traffic intersections are not arranged between the target traffic intersection and the communicated traffic intersection.

2) The traffic intersection with the Distance from the target traffic intersection smaller than the Distance threshold is taken as the adjacent traffic intersection, where the Distance may be a Graph Distance (Graph Distance), but may also be other types of distances.

3) The method comprises the steps of obtaining a plurality of vehicle driving records comprising a target traffic intersection, and taking the traffic intersection which is included in the vehicle driving records with the occurrence frequency larger than a frequency threshold value and is different from the target traffic intersection as an adjacent traffic intersection. The vehicle driving record comprises a traffic intersection passed by the vehicle in the driving process. The vehicle travel record here may refer to a vehicle travel record in a real environment, or may refer to a vehicle travel record in a simulated environment.

In fig. 3D, step 204 shown in fig. 3B may be updated to step 402, and in step 402, phase prediction processing is performed on the current simulated lane states, the simulated traffic phases and the simulated traffic phase sequences of the multiple traffic intersections in the combined traffic intersection through the phase control models corresponding to the training traffic phase sequences, so as to obtain training phase probabilities respectively corresponding to the current simulated traffic phase and the training candidate traffic phase of the target traffic intersection.

Similarly, during the phase prediction process, a plurality of traffic intersections in the combined traffic intersection may also be referred to.

In fig. 3D, step 206 shown in fig. 3B can be implemented by steps 403 to 405, and will be described with reference to each step.

In step 403, a new simulated lane state obtained after the target traffic intersection simulation application trains the target traffic phase is determined.

In step 404, determining a distance weight corresponding to any one of the traffic intersections according to the distance between any one of the combined traffic intersections and the target traffic intersection; wherein the distance weight is inversely related to the distance.

For each traffic intersection in the combined traffic intersection, the distance weight corresponding to the traffic intersection can be determined according to the distance between the traffic intersection and the target traffic intersection. The distance weight and the distance are in negative correlation, that is, the greater the distance between the traffic intersection a and the target traffic intersection, the lower the importance of the traffic intersection a is, and the smaller the distance weight corresponding to the traffic intersection a is.

In step 405, the new simulated lane status is weighted according to the distance weights corresponding to the multiple traffic intersections in the combined traffic intersection, so as to obtain the control reward.

For example, the new simulated lane states are weighted and summed according to the distance weights corresponding to a plurality of traffic intersections in the combined traffic intersection, so as to obtain the control reward. And if the distance weight corresponding to each traffic intersection in the combined traffic intersection is larger, the control reward is larger. Therefore, all the traffic intersections in the combined traffic intersection can be comprehensively considered, and the comprehensiveness of the calculated control reward is improved.

As shown in fig. 3D, in the embodiment of the present application, the effect of reinforcement learning can be further improved by comprehensively referring to each of the combined traffic intersections.

In some embodiments, referring to fig. 3E, fig. 3E is a schematic flow chart of the traffic control method provided in the embodiments of the present application, and step 101 shown in fig. 3A may be updated to step 501, in step 501, a sequence prediction process is performed according to current lane states and traffic phase sequences of multiple traffic intersections in a combined traffic intersection, so as to obtain sequence probabilities respectively corresponding to multiple traffic phase sequences; wherein the combined traffic intersection includes the target traffic intersection and an adjacent traffic intersection to the target traffic intersection.

Here, in the process of performing traffic control on the target traffic intersection, in addition to the case of referring to the target traffic intersection, the case of an adjacent traffic intersection of the target traffic intersection may be referred to. For example, sequence prediction processing may be performed according to current lane states and traffic phase sequences of a plurality of traffic intersections in a combined traffic intersection, so as to obtain sequence probabilities corresponding to the plurality of traffic phase sequences, where the combined traffic intersection includes a target traffic intersection and an adjacent traffic intersection of the target traffic intersection.

It should be noted that, here, the sequence control model may perform sequence prediction processing on the current lane states and the traffic phase sequences of a plurality of traffic intersections in the combined traffic intersection to obtain sequence probabilities respectively corresponding to the plurality of traffic phase sequences.

In fig. 3E, step 103 shown in fig. 3A can be updated to step 502, and in step 502, phase prediction processing is performed according to the current lane states, traffic phases and traffic phase sequences of a plurality of traffic intersections in the combined traffic intersection, so as to obtain phase probabilities corresponding to the current traffic phase and the candidate traffic phase of the target traffic intersection, respectively.

Similarly, the object of the phase prediction processing may also become a combination of the current lane states, traffic phases, and traffic phase sequences at a plurality of traffic intersections in the traffic intersection.

It should be noted that, here, the phase prediction processing may be performed on the current lane states, the traffic phases, and the traffic phase sequence of the multiple traffic intersections in the combined traffic intersection through the phase control model corresponding to the target traffic phase sequence, so as to obtain the phase probabilities respectively corresponding to the current traffic phase and the candidate traffic phase of the target traffic intersection.

In some embodiments, before step 501, further comprising: performing any one of the following processes for the target traffic intersection: taking a traffic intersection communicated with a target traffic intersection as an adjacent traffic intersection; wherein, other traffic intersections are not arranged between the target traffic intersection and the communicated traffic intersection; taking the traffic intersection with the distance between the traffic intersection and the target traffic intersection smaller than the distance threshold as an adjacent traffic intersection; the method comprises the steps of obtaining a plurality of vehicle driving records comprising a target traffic intersection, and taking the traffic intersection which is included in the vehicle driving records with the occurrence frequency larger than a frequency threshold value and is different from the target traffic intersection as an adjacent traffic intersection.

As shown in fig. 3E, in the embodiment of the present application, by comprehensively referring to each traffic intersection in the combined traffic intersection, cooperative control over the combined traffic intersection can be implemented, and the effect of traffic control is further improved.

Next, an exemplary application of the embodiments of the present application in an actual application scenario will be described. The embodiment of the application adopts a layered reinforcement learning mode to carry out the signal control collaborative optimization of the regional road network, and mainly comprises the following three aspects: 1) for each traffic intersection, a plurality of bottom controllers (corresponding to the above phase control model, named controllers herein) are set, each bottom controller is used for realizing a signal control scheme of a specific traffic phase sequence, namely each bottom controller corresponds to one traffic phase sequence, so that the diversity and comprehensiveness of the scheme can be ensured to a greater extent; 2) for each traffic intersection, a top-layer controller (corresponding to the above sequence control model, named as meta-controller) is set for periodically scheduling the bottom-layer controller, so as to ensure the orderliness of the controller scheduling of the traffic intersection; 3) the controller and the meta-controller share part of information with controllers of adjacent traffic intersections, and respectively realize phase synergy of the bottom layer and sequence synergy of the top layer, so that the traffic control effect is further improved.

The embodiment of the application can be applied to a signal lamp cooperative control scene of a road network in any area (such as a certain city), the traffic phase and timing of each traffic intersection in the area are adjusted on line according to the real-time information of road conditions, traffic density, fleet length and the like in the area, the cooperative effect of multiple traffic intersections in the area is exerted, the overall traffic efficiency of the road network is effectively relieved, and the congestion degree of vehicles is greatly reduced. The detailed description will be given next.

First, the relevant principle of reinforcement learning is described. The embodiment of the present application provides a schematic diagram of reinforcement learning as shown in fig. 5, where reinforcement learning refers to a cyclic process in which an Agent (Agent) takes an action according to a State (State) of an Environment (Environment), so as to change the State of the Environment, obtain an incentive (Reward), and adjust a subsequent action according to the incentive, a goal of reinforcement learning is to obtain a maximum incentive, and in the process of reinforcement learning, exploration (explicit) and utilization (explicit) and two aspects can be considered comprehensively.

In the embodiment of the application, each traffic intersection in the road network can be regarded as an Agent, the current situation of the traffic intersection is regarded as Environment, and the intersection State of the traffic intersection is regarded as State. For each Agent, the embodiment of the present application sets controllers of two layers, namely, meta-controller and controller. meta-controller as top controller, each

And making a macro decision for one second, and outputting a macro-action (such as the serial number of a controller corresponding to the above target traffic phase sequence) to activate the controller corresponding to the macro-action, wherein,

corresponding to the sequence prediction period above. The controller is one of a plurality of bottom controllers, and each time the controller is selected by macro-action output from the top layer

Making a decision once per second to make the next traffic intersection

The target traffic phase of the internal application, wherein,

corresponding to the phase prediction period above. In the embodiment of the present application,

e.g. of

Can be

Integer multiples of, e.g.

(ii) =15 seconds for which,

=60

and =900 seconds.

For each Action, the embodiment of the present application sets a network (referred to as an artificial neural network) architecture for the meta-controller and the underlying controller, and provides a schematic diagram of the network architecture shown in fig. 6 as an example. Both the meta-controller and the controller may use the A2C algorithm for model training, but the training algorithm is not limited thereto. Next, meta-controller and controller will be described separately.

1) And a controller of the bottom layer controller. Here, the related definitions of the state, action set (action set) and reward corresponding to the controller are mainly referred to, where the action set is a set of executable actions.

For a traffic intersection such as a crossroad, the traffic intersection may include 8 traffic phases shown in fig. 7, specifically including straight, left-turning and all-pass phases in different directions, where the all-pass phase is a traffic phase released in a single direction, and the traffic phases shown in fig. 7 are (c), (d) and (d), (d) being (d) a) being b). Specifically, the traffic phase shown in fig. 7 indicates that the vehicle travels straight east and west, the traffic phase indicates that the vehicle turns left east and west, the traffic phase indicates that the vehicle travels straight south and north, the traffic phase indicates that the vehicle turns left south and north, the traffic phase is released all from west to east, the traffic phase is released all from east to west, the traffic phase is released all from south to north, and the traffic phase is released all from north to south. For other types of traffic intersections (e.g., T-junctions), the supported traffic phases can be a subset of the set of 8 traffic phases shown in FIG. 7, and will not be described in detail herein.

The traffic phases shown in fig. 7 may be arranged in a certain order to form a traffic phase sequence, where 4 traffic phase sequences shown in fig. 8 are provided, and in fig. 8, the traffic phase sequence (i) is "traffic phase (r- > traffic phase (r)", that is, left turn allows straight going; the traffic phase sequence (II) is traffic phase (IV-) > traffic phase (III-) > traffic phase (IV); the traffic phase sequence (c) is 'traffic phase (c- > traffic phase (c) } traffic phase (r)'; the traffic phase sequence (iv) is traffic phase (iv), traffic phase (v), and traffic phase (v). Of course, more traffic phase sequences can be defined according to the requirements in the practical application scenario, and are not limited to the 4 shown in fig. 8. The bottom layer controller is implemented by setting for each traffic phase sequence, and for fig. 8, 4 bottom layer controllers are required to be set.

Next, the definitions of the state, action set, and reward corresponding to the controller will be described.

a) action set and output of the underlying controller.

The output of the underlying controller is a probability distribution, i.e., policy, defined over action sets. For all the underlying controllers, the embodiment of the present application defines action set as

Keep means to keep the current traffic phase as the target traffic phase, and next means to take the next traffic phase (corresponding to the candidate traffic phase above) after the current traffic phase in the traffic phase sequence as the target traffic phase.

b) state/input of the underlying controller.

The state is a series of characteristics used for describing the current state of a target traffic intersection and the neighborhood thereof, and also directly serves as part of model input of the underlying controller.

For the target traffic intersection k, the fleet length, the vehicle speed (such as the average vehicle speed) and the vehicle waiting time (such as the first vehicle waiting time) of each driving lane of k can be combined to obtain the lane state of k. Here, use is made of

Represents the set of all the entering lanes of k, then the lane state of k

Wherein

、

And

respectively indicating lanes

The fleet length, the first vehicle waiting duration and the average vehicle speed. In addition, the embodiment of the present application also uses the current traffic phase (serial number of traffic phase) and the current traffic phase sequence (serial number of traffic phase sequence) of the traffic intersection as the necessary features for describing the intersection state, such as,

the current traffic phase representing the target traffic intersection k is the traffic phase (r) shown in figure 7,

it is shown that the traffic phase sequence (c) shown in fig. 8 is currently adopted at the target traffic intersection k. The complete intersection state of the target traffic intersection k can be described as

。

In order to realize the cooperative control effect of the traffic intersections, each traffic intersection needs to observe the intersection state of the traffic intersection and the intersection state of the adjacent traffic intersection before determining the target traffic phase. In the embodiment of the present application, the adjacent traffic intersection of the target traffic intersection may refer to a traffic intersection where the target traffic intersection is directly connected, that is, there is no other traffic intersection between the target traffic intersection and the adjacent traffic intersection. For ease of understanding, a schematic diagram as shown in FIG. 9 is shown, for traffic intersection A, its neighboring traffic intersections include traffic intersections B, C, D and E.

If it is used

Representing the target traffic intersection k and its set of adjacent traffic intersections (corresponding to the combined traffic intersection above), the intersection status of the combined traffic intersection can be represented as

Wherein

i.e. the input of the bottom layer controller corresponding to the target traffic intersection k.

c）reward。

The embodiment of the application sets the reward function of the combined traffic intersection to realize the cooperative control of the combined traffic intersection. For the target traffic intersection k, the reward can consider the fleet length and the first waiting time, so the reward of the target traffic intersection k can be

Wherein

is a weight of the duration of the first-car waiting, e.g.

The value range can be 0-0.5, and it is worth explaining that the value range and the specific value related to the embodiment of the present application are examples, and do not constitute a limitation to the embodiment of the present application, that is, the value range and the specific value related to the embodiment of the present application can be adaptively adjusted according to the requirements in the actual application scenario. Reward for a combined traffic intersection can be represented as

Wherein

in order to be the distance weight,

the value of (2) can be in the range of 0 to 1, for example, 0.5 can be preferred.

Indicating a traffic intersection

The Distance to the traffic intersection k may be a Graph Distance (Graph Distance). Since the combined traffic intersection in the embodiment of the application is composed of only the target traffic intersection k and the adjacent traffic intersections thereof, the combined traffic intersection is composed of the target traffic intersection k and the adjacent traffic intersections thereof

The following can be set:

2) the top controller meta-controller. Similarly, the state corresponding to meta-controller, action set (named macro-action set), and reward are defined separately.

a) macro-action set and the output of the top level controller.

The output of the meta-controller is the probability distribution over macro-action sets, which may be referred to as macro-policy. The macro-action set is a set of all traffic phase sequences at the target traffic intersection, for example, 4 traffic phase sequences shown in fig. 8 can be added to one macro-action set.

b) Input to the top level controller.

The meta-controller is used for dispatching the bottom controller, and when the meta-controller selects the bottom controller which needs to be activated, the traffic phase of the target traffic intersection does not need to be considered, namely the meta-controller only carries out macroscopic selection aiming at the traffic phase sequence, and the meta-controller is irrelevant to the current traffic phase of the target traffic intersection and the current traffic phase is considered by the bottom controller. Thus, for a meta-controller, the current intersection status of the target traffic intersection k is defined as

Wherein

、

And

respectively indicating lanes

The fleet length, the first vehicle waiting duration and the average vehicle speed,

representing the set of all incoming lanes of the target traffic intersection k. For a meta-controller, the intersection status of a combined traffic intersection may be represented as

Wherein

is the input of the meta-controller corresponding to the target traffic intersection k.

c）reward。

Since each meta-controller is used

Make a decision every second, the controller at the bottom layer

Second making a decision and calculating a reward for the decision

To do so by

Is composed of

For example, the reward obtained by a meta-controller making a decision is the future

First controller in seconds (future)

Only one controller is activated in seconds) is defined as follows:

wherein M may be an integer greater than 1.

3) And (5) network architecture.

In the embodiment of the present application, the network architectures of the meta-controller and the underlying controller may be set to be similar, and the difference is that the input of the meta-controller does not include the current traffic phase of the combined traffic intersection. As an example, a schematic diagram of a network architecture of a meta-controller shown in fig. 10 is provided, and the inputs of the meta-controller include a platoon length, a first car waiting time length, and an average vehicle speed extracted by each of the combined traffic intersections in a vehicle driving direction (referring to a direction of entering the traffic intersection), and further include a current traffic phase sequence of each of the combined traffic intersections. Corresponding to various types of inputs of the meta-controller (for example, the fleet length is one type, the first car waiting duration is another type), there are full connection layers (corresponding to the above full connection sub-networks), and then there is a stateful LSTM network, which is used to perform memory transfer processing on the full connection results output by each full connection layer, so as to obtain the memory transfer results. Where the fully connected layer corresponds to the FC shown in fig. 10, FC (64) shown in fig. 10 indicates that the FC layer includes 64 neurons, and so on.

As an example, a schematic diagram of a network architecture of a controller shown in fig. 11 is also provided, and inputs of the controller include a fleet length, a first-vehicle waiting duration, and an average vehicle speed extracted by a vehicle driving direction (referring to a direction of entering the traffic intersection) at each of the combined traffic intersections, and further include a current traffic phase and a traffic phase sequence at each of the combined traffic intersections. Similar to the meta-controller, in the controller, corresponding to the various types of inputs to the controller is a fully connected layer, followed by a stateful LSTM network.

In the embodiment of the present application, the respective controllers may be trained using A2C, and thus, the output of each controller is set to include two branches, namely, an operator branch and a critic branch. For an operator branch, performing probability normalization processing on an output result (namely a memory transfer result) of the LSTM network by using a softmax activation function to obtain policy, namely the probability of each action in an action set (or macro-action set); for the critic branch, Linear regression (Linear) processing is performed on the output result of the LSTM network to obtain the estimated value (value).

4) And (5) network training.

In the embodiment of the application, a traffic flow simulation platform can be constructed by an open-source simulation tool and urban road vehicle behaviors can be simulated. Here, taking the SUMO tool as an example for explanation, it is possible to control the signal light state of the simulation environment and access the simulation data of the traffic intersection through the Traci API provided by the SUMO tool, where the simulation data includes the simulation lane state, the simulation traffic phase, and the simulation traffic phase sequence. For easy understanding, the phase prediction period of the controller of each base layer is set

=15 seconds, the duration of the yellow light is kept at 5 seconds for the signal light if a green to red light transition is involved during the switching of the traffic phase, i.e. green light->Yellow light (lasting 5 seconds)>Red light (this approach may also be applicable to real-world traffic control); sequence prediction period of meta-controller

=60

=900 seconds; each epicode in the reinforcement learning process has a duration of 3600 seconds, where epicode is the period in which the simulator is restarted.

Here, a schematic diagram of model training by the reinforcement learning principle as shown in fig. 12 is provided, and will be described with reference to fig. 12.

Still taking the target traffic intersection k as an example, the underlying controller is trained in a similar manner as the meta-controller. In the embodiment of the present application, the training data of the model training is in units of batches (batch), and here, represents a packageComprises

The empirical data for the batch of each cycle (top empirical data for meta-controller or bottom empirical data for controller) is:

wherein,

is shown as

A period (phase prediction period or sequence prediction period, for example, for the underlying controller,

=15s is one cycle; in the case of the top-level controller,

=60

one period) of time,

is shown as

The intersection states of the traffic intersection are combined at the beginning of each period,

is shown as

The action performed at the beginning of the cycle,

indicating execution

The resulting reward (either of the bottom level controllers or of the top level controllers). For the underlying controller, it is possible to set

= 60; for meta-controller, it is possible to set

And = 4. The loss function during training is as follows:

wherein,

and

respectively representing the outputs of the critical branch and the operator branch,

is shown in

Is performed under

Probability of (corresponding to the training phase probability above);

representing rights of a networkA weight parameter;

；

；

indicating that the training has not been optimized before the training is performed

The criticc branch of (1) is output.

For the underlying controller, the result is

And

the first phase loss value and the second phase loss value correspond to the above values respectively; for meta-controller, obtained

And

corresponding to the first and second sequence penalty values above, respectively.

The completion condition of the model training is not limited in the embodiment of the present application, and for example, the completion number of the epicode reaches a number threshold, such as 10⁵Next, the process is carried out. After model training is completed, for each controller, the criticc branch of the network architecture used only for training may be discarded. Thus, the traffic control can be performed on the target traffic intersection k in the real environment according to the trained meta-controller and the trained controllers corresponding to the target traffic intersection k.

It should be noted that the network architecture, the intersection state, the reward and the like of the controller shown above do not constitute limitations on the embodiment of the present application, and can be adaptively adjusted according to requirements in an actual application scenario, so as to improve a traffic control effect in the actual application scenario. In addition, in the above example, the model training is performed by using A2C method, and according to the requirements in the actual application scenario, other training algorithms may also be used for performing model training, such as DQN algorithm, A3C algorithm, and the like.

The embodiment of the application can be applied to a traffic intersection cooperative control scene of a regional or urban road network, can effectively reduce the congestion degree of the traffic intersection, reduces the parking times of vehicles, and improves the passing rate and the speed of the vehicles at the traffic intersection.

Continuing with the exemplary structure of the traffic control device 455 provided by the embodiments of the present application implemented as software modules, in some embodiments, as shown in fig. 2, the software modules stored in the traffic control device 455 of the memory 450 may include: the sequence selection module 4551 is configured to perform sequence prediction processing according to a current lane state of the target traffic intersection and the traffic phase sequence, so as to obtain sequence probabilities corresponding to the multiple traffic phase sequences respectively; wherein the traffic phase sequence comprises a plurality of traffic phases having a sequence; the sequence selection module 4551 is further configured to use the traffic phase sequence with the largest sequence probability as the target traffic phase sequence; the phase selection module 4552 is configured to perform phase prediction processing according to the current lane state, the traffic phase, and the traffic phase sequence of the target traffic intersection, and obtain phase probabilities corresponding to the current traffic phase and the candidate traffic phase of the target traffic intersection, respectively; wherein the candidate traffic phase represents a traffic phase in the sequence of target traffic phases that is subsequent to the current traffic phase; the phase selection module 4552 is further configured to use the traffic phase with the largest phase probability as the target traffic phase, and apply the target traffic phase at the target traffic intersection.

In some embodiments, the sequence selection module 4551 is further configured to: performing sequence prediction processing on the current lane state and traffic phase sequence of the target traffic intersection through a sequence control model; the phase selection module 4552 is further configured to: performing phase prediction processing on the current lane state, the traffic phase and the traffic phase sequence of the target traffic intersection through a phase control model corresponding to the target traffic phase sequence; the traffic phase sequences respectively correspond to a phase control model.

In some embodiments, the traffic control device 455 further includes a reinforcement learning module to: carrying out environment simulation processing on the target traffic intersection to obtain a simulated lane state, a simulated traffic phase and a simulated traffic phase sequence; performing sequence prediction processing on the current simulated lane state and the simulated traffic phase sequence of the target traffic intersection through a sequence control model to obtain training sequence probabilities corresponding to a plurality of traffic phase sequences respectively; taking the traffic phase sequence with the maximum probability of the training sequence as a training traffic phase sequence; performing phase prediction processing on the current simulated lane state, the simulated traffic phase and the simulated traffic phase sequence of the target traffic intersection through a phase control model corresponding to the training traffic phase sequence to obtain training phase probabilities respectively corresponding to the current simulated traffic phase and the training candidate traffic phase of the target traffic intersection; wherein the training candidate traffic phase represents a traffic phase in the training traffic phase sequence that is located after the current simulated traffic phase; taking the traffic phase with the maximum probability of the training phase as a training target traffic phase, and simulating and applying the training target traffic phase at a target traffic intersection; determining a new simulated lane state obtained after training a target traffic phase in a target traffic intersection simulation application, and determining a control reward according to the new simulated lane state; and performing reinforcement learning on the sequence control model and the phase control model corresponding to the training traffic phase sequence according to the control reward.

In some embodiments, the reinforcement learning module is further to: performing sequence value prediction processing on the current simulated lane state and the simulated traffic phase sequence of the target traffic intersection through a sequence control model to obtain a sequence value; performing phase value prediction processing on the current simulated lane state, the simulated traffic phase and the simulated traffic phase sequence of the target traffic intersection through a phase control model corresponding to the training traffic phase sequence to obtain a phase value; the sequence value, the phase value and the control reward are commonly used for carrying out reinforcement learning on the sequence control model and the phase control model corresponding to the training traffic phase sequence.

In some embodiments, the reinforcement learning module is further to: determining a first phase loss value according to the control reward and the phase value, and determining a second phase loss value according to the control reward, the phase value and the training phase probability corresponding to the training target traffic phase; training a phase control model corresponding to the training traffic phase sequence according to the first phase loss value and the second phase loss value; determining a first sequence loss value according to the control reward and the sequence value, and determining a second sequence loss value according to the control reward, the sequence value and the training sequence probability corresponding to the training traffic phase sequence; and training the sequence control model according to the first sequence loss value and the second sequence loss value.

In some embodiments, the reinforcement learning module is further configured to perform any one of: sequentially training phase control models corresponding to the training traffic phase sequence according to the first phase loss value and the second phase loss value; the training priority of the first phase loss value is greater than that of the second phase loss value, or the training priority of the second phase loss value is greater than that of the first phase loss value; and performing loss value fusion processing on the first phase loss value and the second phase loss value, and training a phase control model corresponding to the training traffic phase sequence according to the obtained fusion loss value.

In some embodiments, the sequence control model includes a fully-connected network and a memory transfer network, the fully-connected network including fully-connected sub-networks corresponding to a current simulated lane state and a simulated traffic phase sequence of the target traffic intersection, respectively; the reinforcement learning module is further configured to: performing full-connection processing on the current simulated lane state of the target traffic intersection through a full-connection sub-network corresponding to the current simulated lane state of the target traffic intersection to obtain a full-connection result corresponding to the current simulated lane state of the target traffic intersection; performing full-connection processing on the current simulated traffic phase sequence of the target traffic intersection through a full-connection sub-network corresponding to the current simulated traffic phase sequence of the target traffic intersection to obtain a full-connection result corresponding to the current simulated traffic phase sequence of the target traffic intersection; the current simulation lane state of the target traffic intersection and the full-connection result corresponding to the simulation traffic phase sequence are subjected to memory transfer processing through a memory transfer network, and a memory transfer result is obtained; carrying out probability normalization processing on the memory transmission result to obtain training sequence probabilities respectively corresponding to the plurality of traffic phase sequences; and performing linear regression processing on the memory transfer result to obtain a sequence value.

In some embodiments, the reinforcement learning module is further to: performing sequence prediction processing on the current simulated lane states and simulated traffic phase sequences of a plurality of traffic intersections in the combined traffic intersection through a sequence control model; wherein the combined traffic intersection comprises a target traffic intersection and an adjacent traffic intersection of the target traffic intersection; and performing phase prediction processing on the current simulated lane states, simulated traffic phases and simulated traffic phase sequences of a plurality of traffic intersections in the combined traffic intersection through a phase control model corresponding to the training traffic phase sequence.

In some embodiments, the reinforcement learning module is further to: determining the distance weight corresponding to any one traffic intersection according to the distance between any one traffic intersection in the combined traffic intersections and the target traffic intersection; wherein the distance weight is inversely related to the distance; and weighting the new simulated lane state according to the distance weights respectively corresponding to the plurality of traffic intersections in the combined traffic intersection to obtain the control reward.

In some embodiments, the reinforcement learning module is further to: performing any one of the following processes for the target traffic intersection: taking a traffic intersection communicated with a target traffic intersection as an adjacent traffic intersection; wherein, other traffic intersections are not arranged between the target traffic intersection and the communicated traffic intersection; taking the traffic intersection with the distance between the traffic intersection and the target traffic intersection smaller than the distance threshold as an adjacent traffic intersection; the method comprises the steps of obtaining a plurality of vehicle driving records comprising a target traffic intersection, and taking the traffic intersection which is included in the vehicle driving records with the occurrence frequency larger than a frequency threshold value and is different from the target traffic intersection as an adjacent traffic intersection.

In some embodiments, the new simulated lane state includes first state data and second state data; the congestion degree of the target traffic intersection is negatively correlated with the first state data and positively correlated with the second state data; the reinforcement learning module is further configured to: performing state data fusion processing on the first state data and the second state data to obtain control rewards; wherein the control award is positively correlated with the first status data and negatively correlated with the second status data.

In some embodiments, the sequence selection module 4551 is further configured to perform sequence prediction processing according to a current lane state of the target traffic intersection and the traffic phase sequence when the sequence prediction period arrives; the phase selection module 4552 is further configured to perform phase prediction processing according to the current lane state, the traffic phase, and the traffic phase sequence of the target traffic intersection when the phase prediction period arrives; wherein the sequence prediction period is greater than the phase prediction period.

In some embodiments, the traffic control device 455 further includes a lane status determining module for combining the current fleet length, the vehicle speed, and the vehicle waiting time of the incoming lane at the target traffic intersection to obtain the current lane status at the target traffic intersection.

In some embodiments, the sequence selection module 4551 is further configured to perform a sequence prediction process according to current lane states and traffic phase sequences of a plurality of traffic intersections in the combined traffic intersection; wherein the combined traffic intersection comprises a target traffic intersection and an adjacent traffic intersection of the target traffic intersection; the phase selection module 4552 is further configured to perform phase prediction processing according to current lane states, traffic phases and traffic phase sequences of multiple traffic intersections in the combined traffic intersection.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions (i.e., executable instructions) stored in a computer readable storage medium. The processor of the electronic device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the electronic device executes the traffic control method according to the embodiment of the present application.

Embodiments of the present application provide a computer-readable storage medium storing executable instructions, where the executable instructions are stored, and when executed by a processor, will cause the processor to execute the traffic control method provided by the embodiments of the present application.

In some embodiments, the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.

In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

As an example, executable instructions may be deployed to be executed on one electronic device or on multiple electronic devices located at one site or distributed across multiple sites and interconnected by a communication network.

The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims

1. A traffic control method, characterized in that the method comprises:

when the sequence prediction period arrives, performing sequence prediction processing according to the current lane state of the target traffic intersection and the traffic phase sequence to obtain sequence probabilities corresponding to the traffic phase sequences respectively; wherein the traffic phase sequence comprises a plurality of traffic phases having a sequence; the sequence probability represents the matching degree of the corresponding traffic phase sequence and the current situation of the target traffic intersection;

when the phase prediction period arrives, performing phase prediction processing according to the current lane state, the traffic phase and the traffic phase sequence of the target traffic intersection to obtain phase probabilities respectively corresponding to the current traffic phase and the candidate traffic phase of the target traffic intersection; wherein the candidate traffic phase represents a traffic phase in the sequence of target traffic phases that is subsequent to the current traffic phase; the phase probability represents the matching degree of the corresponding traffic phase and the current situation of the target traffic intersection; the sequence prediction period is greater than the phase prediction period;

2. The method of claim 1, wherein the performing the sequence prediction process according to the current lane status and the traffic phase sequence at the target traffic intersection comprises:

performing sequence prediction processing on the current lane state and traffic phase sequence of the target traffic intersection through a sequence control model;

the phase prediction processing according to the current lane state, the traffic phase and the traffic phase sequence of the target traffic intersection comprises the following steps:

performing phase prediction processing on the current lane state, the traffic phase and the traffic phase sequence of the target traffic intersection through a phase control model corresponding to the target traffic phase sequence;

the plurality of traffic phase sequences respectively correspond to one phase control model.

3. The method of claim 2, wherein before performing the sequence prediction processing on the current lane status and traffic phase sequence of the target traffic intersection by the sequence control model, the method further comprises:

carrying out environment simulation processing on the target traffic intersection to obtain a simulated lane state, a simulated traffic phase and a simulated traffic phase sequence;

when the sequence prediction period is reached, performing sequence prediction processing on the current simulated lane state and the simulated traffic phase sequence of the target traffic intersection through the sequence control model to obtain training sequence probabilities respectively corresponding to the plurality of traffic phase sequences;

taking the traffic phase sequence with the maximum probability of the training sequence as a training traffic phase sequence;

when the phase prediction period is reached, performing phase prediction processing on the current simulated lane state, the current simulated traffic phase and the current simulated traffic phase sequence of the target traffic intersection through a phase control model corresponding to the training traffic phase sequence to obtain training phase probabilities respectively corresponding to the current simulated traffic phase and the training candidate traffic phase of the target traffic intersection;

wherein the training candidate traffic phase represents a traffic phase in the training traffic phase sequence that is subsequent to the current simulated traffic phase;

taking the traffic phase with the maximum probability of the training phase as a training target traffic phase, and simulating and applying the training target traffic phase at the target traffic intersection;

determining a new simulated lane state obtained after the training target traffic phase is applied to the target traffic intersection in a simulated manner, and determining control reward according to the new simulated lane state;

and performing reinforcement learning on the sequence control model and the phase control model corresponding to the training traffic phase sequence according to the control reward.

4. The method of claim 3, wherein when the sequence prediction processing is performed on the current simulated lane states and simulated traffic phase sequence at the target traffic intersection by the sequence control model, the method further comprises:

performing sequence value prediction processing on the current simulated lane state and the simulated traffic phase sequence of the target traffic intersection through the sequence control model to obtain a sequence value;

when the phase prediction processing is performed on the current simulated lane state, the current simulated traffic phase and the current simulated traffic phase sequence of the target traffic intersection through the phase control model corresponding to the training traffic phase sequence, the method further comprises the following steps:

performing phase value prediction processing on the current simulated lane state, the current simulated traffic phase and the current simulated traffic phase sequence of the target traffic intersection through a phase control model corresponding to the training traffic phase sequence to obtain a phase value;

5. The method of claim 4, wherein the reinforcement learning of the sequence control model and the phase control model corresponding to the training traffic phase sequence according to the control reward comprises:

determining a first phase loss value according to the control reward and the phase value, and determining a second phase loss value according to the control reward, the phase value and a training phase probability corresponding to the training target traffic phase;

training a phase control model corresponding to the training traffic phase sequence according to the first phase loss value and the second phase loss value;

determining a first sequence loss value according to the control reward and the sequence value, and determining a second sequence loss value according to the control reward, the sequence value and the training sequence probability corresponding to the training traffic phase sequence;

and training the sequence control model according to the first sequence loss value and the second sequence loss value.

6. The method of claim 5, wherein the training the phase control model corresponding to the training traffic phase sequence according to the first phase loss value and the second phase loss value comprises:

any one of the following processes is performed:

sequentially training phase control models corresponding to the training traffic phase sequence according to the first phase loss value and the second phase loss value;

wherein the training priority of the first phase loss value is greater than the training priority of the second phase loss value, or the training priority of the second phase loss value is greater than the training priority of the first phase loss value;

and performing loss value fusion processing on the first phase loss value and the second phase loss value, and training a phase control model corresponding to the training traffic phase sequence according to the obtained fusion loss value.

7. The method of claim 4, wherein the sequence control model comprises a fully-connected network and a memory transfer network, the fully-connected network comprising fully-connected sub-networks corresponding to a current simulated lane state and a simulated traffic phase sequence at the target traffic intersection, respectively;

the sequence prediction processing is performed on the current simulated lane state and the simulated traffic phase sequence of the target traffic intersection through the sequence control model to obtain training sequence probabilities respectively corresponding to the plurality of traffic phase sequences, and the method comprises the following steps:

performing full-connection processing on the current simulated lane state of the target traffic intersection through a full-connection sub-network corresponding to the current simulated lane state of the target traffic intersection to obtain a full-connection result corresponding to the current simulated lane state of the target traffic intersection;

performing full-connection processing on the current simulated traffic phase sequence of the target traffic intersection through a full-connection sub-network corresponding to the current simulated traffic phase sequence of the target traffic intersection to obtain a full-connection result corresponding to the current simulated traffic phase sequence of the target traffic intersection;

carrying out memory transfer processing on the current simulated lane state of the target traffic intersection and the full-connection result respectively corresponding to the simulated traffic phase sequence through the memory transfer network to obtain a memory transfer result;

carrying out probability normalization processing on the memory transfer result to obtain training sequence probabilities respectively corresponding to the plurality of traffic phase sequences;

the sequence value prediction processing is carried out on the current simulated lane state and the simulated traffic phase sequence of the target traffic intersection through the sequence control model to obtain a sequence value, and the sequence value prediction processing comprises the following steps:

and performing linear regression processing on the memory transfer result to obtain a sequence value.

8. The method of claim 3, wherein the performing, by the sequence control model, a sequence prediction process on the current simulated lane status and simulated traffic phase sequence at the target traffic intersection comprises:

performing sequence prediction processing on the current simulated lane states and simulated traffic phase sequences of a plurality of traffic intersections in the combined traffic intersection through the sequence control model;

wherein the combined traffic intersection comprises the target traffic intersection and an adjacent traffic intersection to the target traffic intersection;

the phase prediction processing is carried out on the current simulated lane state, the current simulated traffic phase and the current simulated traffic phase sequence of the target traffic intersection through the phase control model corresponding to the training traffic phase sequence, and the phase prediction processing comprises the following steps:

and performing phase prediction processing on the current simulated lane states, simulated traffic phases and simulated traffic phase sequences of a plurality of traffic intersections in the combined traffic intersection through the phase control model corresponding to the training traffic phase sequence.

9. The method of claim 8, wherein determining a control award based on the new simulated lane condition comprises:

determining the distance weight corresponding to any one of the traffic intersections according to the distance between any one of the combined traffic intersections and the target traffic intersection; wherein a negative correlation exists between the distance weight and the distance;

and according to the distance weights respectively corresponding to a plurality of traffic intersections in the combined traffic intersection, carrying out weighting processing on the new simulated lane state to obtain control rewards.

10. The method of claim 3, wherein the new simulated lane state comprises first state data and second state data; wherein the degree of congestion at the target traffic intersection is negatively correlated with the first status data and positively correlated with the second status data;

the determining a control award based on the new simulated lane condition includes:

performing state data fusion processing on the first state data and the second state data to obtain control rewards;

wherein the control award is positively correlated with the first status data and negatively correlated with the second status data.

11. The method according to any one of claims 1 to 10, wherein the performing of the sequence prediction process according to the current lane status and the traffic phase sequence at the target traffic intersection comprises:

performing sequence prediction processing according to the current lane states and traffic phase sequences of a plurality of traffic intersections in the combined traffic intersection;

and performing phase prediction processing according to the current lane states, traffic phases and traffic phase sequences of a plurality of traffic intersections in the combined traffic intersection.

12. A traffic control device, characterized in that the device comprises:

the sequence selection module is used for carrying out sequence prediction processing according to the current lane state of the target traffic intersection and the traffic phase sequence when a sequence prediction period arrives to obtain sequence probabilities respectively corresponding to the traffic phase sequences; wherein the traffic phase sequence comprises a plurality of traffic phases having a sequence; the sequence probability represents the matching degree of the corresponding traffic phase sequence and the current situation of the target traffic intersection;

the phase selection module is used for carrying out phase prediction processing according to the current lane state, the traffic phase and the traffic phase sequence of the target traffic intersection when a phase prediction period arrives to obtain phase probabilities respectively corresponding to the current traffic phase and the candidate traffic phase of the target traffic intersection; wherein the candidate traffic phase represents a traffic phase in the sequence of target traffic phases that is subsequent to the current traffic phase; the phase probability represents the matching degree of the corresponding traffic phase and the current situation of the target traffic intersection; the sequence prediction period is greater than the phase prediction period;

13. An electronic device, comprising:

a memory for storing executable instructions;

a processor for implementing the traffic control method of any of claims 1 to 11 when executing executable instructions stored in the memory.

14. A computer-readable storage medium storing executable instructions for implementing the traffic control method of any one of claims 1 to 11 when executed by a processor.