CN112991750B

CN112991750B - Local traffic optimization method based on reinforcement learning and generation type countermeasure network

Info

Publication number: CN112991750B
Application number: CN202110526842.0A
Authority: CN
Inventors: 刘新成; 宣帆; 肖通; 徐璀; 周国冬
Original assignee: Suzhou Boyuxin Transportation Technology Co Ltd
Current assignee: Suzhou Boyuxin Transportation Technology Co Ltd
Priority date: 2021-05-14
Filing date: 2021-05-14
Publication date: 2021-11-30
Anticipated expiration: 2041-05-14
Also published as: CN112991750A

Abstract

A local traffic optimization method based on a reinforcement learning and generation type countermeasure network comprises the steps of establishing a training model, automatically improving the accuracy of the model by adopting the generation countermeasure network, and predicting traffic flow data at a specified time by training real traffic flow data detected at a certain intersection; the method comprises the steps of training real traffic flow data and virtual traffic flow data by adopting Q learning to output actions to form a Q value table, obtaining an optimal local traffic optimization strategy by adopting a reward function, utilizing the advantages of reinforcement learning interactive learning, greatly improving the efficiency of traffic signal lamp period adjustment, verifying whether congestion conditions are relieved by adjusting the current congestion level and the traffic signal lamp time ratio of a certain intersection, repeatedly and continuously optimizing to obtain the optimal traffic light time ratio, utilizing the inspiring self-game idea of a generative confrontation network to realize limited time optimal training of Q learning, realizing local traffic optimization, and finally obtaining an optimal adjustment scheme, thereby improving the local traffic optimization capability.

Description

Local traffic optimization method based on reinforcement learning and generation type countermeasure network

Technical Field

The invention belongs to the field of traffic optimization, and particularly relates to a local traffic optimization method based on a reinforcement learning and generation type countermeasure network.

Background

The traditional local traffic optimization method comprises several typical control systems such as TRANSYT, SCOOT and the like, signal timing is optimized mainly through real-time data obtained by vehicle detection equipment, and control is realized through various communication and signal control equipment.

At present, various artificial intelligence methods are applied to traffic control and optimization, however, the methods have limitations in solving the problem of local traffic optimization, the local traffic optimization is a huge system, a large amount of empirical knowledge reasoning required by an expert system and the establishment of a knowledge base are difficult, and traffic parameters are not easily described through some qualitative knowledge and relations. The traditional artificial neural network is easy to fall into local optimization due to the traversability of learning samples, so that other methods need to be combined to improve generalization capability. The existing method has a good effect of solving the problem of traffic optimization at a single intersection. But in the face of complex road sections and local traffic control, the apparent capacity is insufficient. Therefore, it is of great significance to design an optimization scheme capable of efficiently solving the local traffic problem.

Disclosure of Invention

The invention aims to provide a local traffic optimization method based on a reinforcement learning and generation type countermeasure network.

In order to solve the technical problems, the invention adopts the technical scheme that: a local traffic optimization method based on a reinforcement learning and generation type countermeasure network specifically comprises the following steps: a local traffic optimization method based on a reinforcement learning and generation type countermeasure network is characterized by comprising the following steps: s1, establishing a training model, optimizing the training speed of the model by adopting a generated countermeasure network, inputting a real traffic data state set S detected at a certain intersection,

outputting virtual traffic flow data; s2, training the real traffic flow data and the virtual traffic flow data by adopting Q learning and outputting an action set

Form a Q value table with the formula

Wherein a is the learning efficiency,

inputting a state set s as a parameter of a neural network, outputting an action value function Q corresponding to an action to obtain a local traffic optimization scheme, training the local traffic optimization scheme by adopting a reward function, and calculating by utilizing a reward function formula to obtain a return value of the previous action

The reward function is formulated as

Wherein, in the step (A),

is a constant number of times, and is,

represents the average vehicle speed of the lane with lane number i,

the amount of traffic in lane i is shown,

showing the total flow of all lanes in the local traffic network,

is a set standard average speed, the calculated speed is higher than the speed and gives a positive return, and the calculated speed is lower than the speed and gives a negative return, the optimal strategy is found, namely the set of all the optimal actions, and the learning algorithm is updated according to the formula

Wherein, a is learning efficiency, and a is large for learning efficiency

Is greatly influenced by the next state, R is a reward value R,

the selection policy representing the next set of states,

in order to achieve a discount rate, the rate of the discount,

the lower the learning efficiency is, the more affected by the reward value, the best local traffic optimization scheme is obtained.

In some embodiments, the specific steps of step S1 are: establishing a generative confrontation network model, initializing a generator and a discriminator in the generative confrontation network, fixing one party in the training process of the generative confrontation network, updating parameters of the other network, alternately iterating to maximize the error of the other party, and finally generating virtual data distribution which is the same as the real data distribution.

In some embodiments, the fixed party in the generative confrontation network training process is the generator.

In some embodiments, the set of states S for an intersection is all at time t

Set of (2), state

The traffic flow of all lanes at the one-way exit of the crossroad at the moment t, the action, namely the Q value, is adjusted as a period, the period is one-time traffic light switching, and the action set

And for all Q values, the action return value R is the vehicle speed on the road.

In some embodiments, there are four intersection states: the south-north direction is straight, the south-north direction is turned left, the east-west direction is straight, the east-west direction is turned left, 1 represents that green light can pass, 0 represents that red light is forbidden to pass, then four states have four actions, and are represented by a one-dimensional binary array: [1,0,0,0], [0,1,0,0], [0,0,1,0], and [0,0,0,1], simulation of time control of traffic signals is realized by changing the input array, with one second as a unit.

The scope of the present invention is not limited to the specific combinations of the above-described features, and other embodiments in which the above-described features or their equivalents are arbitrarily combined are also intended to be encompassed. For example, the above features and the technical features (but not limited to) having similar functions disclosed in the present application are mutually replaced to form the technical solution.

Due to the application of the technical scheme, compared with the prior art, the invention has the following advantages:

the invention utilizes the advantages of reinforcement learning interactive learning to set period adjustment as action and set traffic flow and local traffic operation condition as state and return, greatly improves the efficiency of traffic signal lamp period adjustment, trains a model through basic data, obtains corresponding reward by state and action, namely checks whether the congestion condition is relieved or not by adjusting the current congestion level and the time ratio of a traffic light at a certain intersection, obtains the optimal time ratio of the traffic light through reciprocating adjustment, utilizes the inspiring self-game thought of a generative confrontation network, can train and generate the confrontation network by utilizing limited basic data, then utilizes new data generated by the generated confrontation network to form virtual data and combines the basic data to improve the reinforcement learning speed, creatively uses the generated confrontation network to realize the optimal training of Q learning, the two are combined with each other, local traffic optimization is realized in the aspect of traffic signal lamp period, the best adjustment scheme is finally obtained, and the local traffic optimization efficiency can be greatly improved.

Drawings

FIG. 1 is a flow diagram of the present invention;

FIG. 2 is a diagram of a generative confrontation network architecture;

FIG. 3 is a diagram of a generative confrontation network training process;

FIG. 4 is a schematic view of a partial traffic network;

fig. 5 is a schematic view of the traffic optimization principle.

Detailed Description

The invention is described below with reference to the accompanying drawings:

(1) data set and feature selection

The traffic flow of an intersection is set as a data set, a typical intersection is researched by the invention, as shown in fig. 4, the state space size of the intersection is the traffic flow size of all roads, the action is set as a red light or a green light, the action quality is judged by using the speed size of the roads as reward return, one time of traffic light switching is regarded as one period, the action adjustment is set to be carried out every three periods, namely the adjustment of the time ratio of the traffic light, an optimal Q value table is found through a large amount of training and is applied to the specific intersection, and the time ratio of the traffic light of a signal light can be adjusted in time to optimize traffic.

(2) Detailed description of the invention

The method introduces a generative confrontation network for improving the training effect of the model on normal data and simultaneously inhibiting the generalization capability of the model on abnormal data, as shown in fig. 2, the generative confrontation network comprises a generator G and a discriminator D, the generator G tries to generate traffic flow sample data closer to reality, the discriminator D tries to perfectly distinguish the real data from the generated data so as to generate data which is required to be obtained, and then the network structure is shown in fig. 2.

The objective function of the generative confrontation network model is as follows:

，

wherein the content of the first and second substances,

as real data

The distribution of (a) to (b) is,

for noise variance, D is discriminant function, x is true data, D (x) is probability of discriminant true data, D (G (z)) is probability of discriminant generated data, and D is trained to maximize

And

training G minimization

I.e. to maximize the loss of D. Also can be combined with

And

in the sense of the loss of D,

the loss of G is understood, one side is fixed in the training process, the parameters of the other network are updated, alternate iteration is performed to maximize the error of the other side, and finally G can estimate the distribution of sample data, namely the generated sample is more real.

In the embodiment, the idea of the generative confrontation network algorithm is to initialize G and D at first, then in each iteration process, fix G and train D; selecting m sample points from the data set, and selecting m vectors from a distribution (uniform distribution, normal distribution, etc.); taking a vector z in the m vectors as the input of the network to obtain m generated data; train D to maximize

And

training G minimization

(ii) a G hope

Approaching 1, i.e., positive class, so that G losses will be minimal, D expects the output of the real data to approach 1, generating an output of the data

Approaching 0.

The generative confrontation network training process is shown in fig. 3, wherein a light dotted line represents the corresponding distribution of the generated data in the discriminator, a dark dotted line represents the distribution of the real data, a solid line represents the generation distribution of the data, and fig. 3 (a) shows that the classification capability of the training system is limited when the training is started; fig. 3 (b), D training effect is better, and generated data can be clearly distinguished; fig. 3 (c), the solid line and the dark dotted line deviate, the light dotted line drops, which indicates that the probability of generating data drops, the solid line moves in the direction of the light dotted line, G is promoted during the training process, G also affects the distribution of D, and if G is fixed, D is trained to be optimal, the formula is as follows:

wherein the content of the first and second substances,

in order to be the distribution of the real data x,

to generate a distribution of data x

Increasingly approaching

，

Approaching 0.5, i.e. the state of fig. 3 (d), i.e. the training result is finally obtained, and the distribution is the same as the distribution, and agent training for reinforcement learning is performed simultaneously with the generated data and the real data.

As shown in fig. 4, the basic traffic implementation principle is as follows: the signal controller sends an action by controlling the next second state of the signal lamp, thereby changing the vehicle speed state of the lane detected by the roadside detector, and then obtaining a cyclic process of reward in interaction with the environment, so the Markov property is simply expressed as: m = < S, A, Ps, a, R >.

Specifically, as shown in fig. 5, a certain intersection is provided with N lanes at a one-way exit, a detector is provided on each road in each direction for detecting a vehicle to obtain a vehicle speed V, and the road with a length L is divided into M regional sections, so that the size of a state space at the time t of the exit can be obtained, and the size is defined as:

and the state set S of the intersection is all

A collection of (a).

In this embodiment, the right turn is set without being controlled by the signal lamp, so that there are four states at an intersection: the south-north direction is straight, the south-north direction is turned left, the east-west direction is straight, the east-west direction is turned left, 1 represents that green light can pass, 0 represents that red light is forbidden to pass, then four states have four actions, and are represented by a one-dimensional binary array: [1,0,0,0], [0,1,0,0], [0,0,1,0] and [0,0,0,1], simulation of time control of the traffic light signal is realized by changing the input array, for example, the input [1,0, 0], [1,0,0,0] represents the straight-going green light in the north-south direction for two seconds.

The reward function needs to reflect the unblocked jam condition of the local traffic network, the traffic condition can be well judged according to the traffic speed of the lanes under normal conditions, the faster the average speed is, the better the traffic is, because the traffic flow of each lane has the size difference, the calculation of the average speed can not be directly carried out on all lanes in the area, the lane with the large traffic flow has a large contribution to the average speed of the whole local network, and a large proportion is given to the lane, and the reward function formula is as follows:

，

wherein the content of the first and second substances,

is a constant number of times, and is,

represents the average vehicle speed of the lane with lane number i,

the amount of traffic in lane i is shown,

showing the total flow of all lanes in the local traffic network,

is a set standard average speed above which the calculated speed gives a positive return and below which a negative return is given.

For the storage of Q values, the inputs are per state and the outputs are actions, i.e. Q values:

，

wherein the content of the first and second substances,

the input is a state set S and the action value function Q corresponding to the action is output.

Using a reward function formulaCalculating to obtain the last action return value

。

The virtual traffic data and the real traffic data are used for training a neural network, so that a real action value function is approximated, and an optimal strategy, namely a set of all optimal actions, is found.

The learning algorithm updates the formula as:

，

wherein, a is learning efficiency, and a is large for learning efficiency

Is greatly affected by the next state. R is the value of the reward R,

the selection policy representing the next set of states.

In order to achieve a discount rate, the rate of the discount,

the lower the learning efficiency is, the more influenced by the reward value.

The above embodiments are merely illustrative of the technical ideas and features of the present invention, and the purpose thereof is to enable those skilled in the art to understand the contents of the present invention and implement the present invention, and not to limit the protection scope of the present invention. All equivalent changes and modifications made according to the spirit of the present invention should be covered within the protection scope of the present invention.

Claims

1. A local traffic optimization method based on a reinforcement learning and generation type countermeasure network is characterized by comprising the following steps: s1, establishing a training model, optimizing the training speed of the model by adopting a generated countermeasure network, and inputting a certain tenOutputting virtual traffic flow data by using a real traffic flow data state set S detected by a word road interface; s2, training the real traffic flow data and the virtual traffic flow data by Q learning and outputting an action set to form a Q value table, wherein the formula is

Wherein a is the learning efficiency,

inputting parameters of a neural network into a state set s, outputting an action value function Q corresponding to an action to obtain a local traffic optimization scheme, training the local traffic optimization scheme by adopting a reward function, and calculating by utilizing a reward function formula to obtain a previous action return value, wherein the reward function formula is

Wherein, in the step (A),

is a constant number of times, and is,

represents the average vehicle speed of the lane with lane number i,

the amount of traffic in lane i is shown,

showing the total flow of all lanes in the local traffic network,

Wherein, a is learning efficiency, and a is large for learning efficiency

Is greatly influenced by the next state, R is a reward value R,

the selection policy representing the next set of states,

in order to achieve a discount rate, the rate of the discount,

2. The local traffic optimization method based on reinforcement learning and generative confrontation network as claimed in claim 1, wherein the step S1 using the generative confrontation network to optimize the model training speed comprises the following specific steps: establishing a generative confrontation network model, initializing a generator and a discriminator in the generative confrontation network, fixing one party in the training process of the generative confrontation network, updating parameters of the other network, alternately iterating to maximize the error of the other party, and finally generating virtual data distribution which is the same as the real data distribution.

3. The local traffic optimization method based on reinforcement learning and generative confrontation network as claimed in claim 2, wherein: the fixed party is a generator in the process of generating confrontation network training.

4. The local traffic optimization method based on reinforcement learning and generative confrontation network as claimed in claim 1, wherein: all the state sets S of a certain crossroad at the moment t

The action, namely Q value is used as cycle adjustment, the cycle is one-time traffic light switching, the action set is a set of all Q values, and the action return value R is the speed of the vehicle on the road.

5. The local traffic optimization method based on reinforcement learning and generative confrontation network as claimed in claim 1, wherein: there are four intersection states: the south-north direction is straight, the south-north direction is turned left, the east-west direction is straight, the east-west direction is turned left, 1 represents that green light can pass, 0 represents that red light is forbidden to pass, then four states have four actions, and are represented by a one-dimensional binary array: [1,0,0,0], [0,1,0,0], [0,0,1,0], and [0,0,0,1], simulation of time control of traffic signals is realized by changing the input array, with one second as a unit.