CN113506450A

CN113506450A - Qspare-based single-point signal timing scheme selection method

Info

Publication number: CN113506450A
Application number: CN202110856591.2A
Authority: CN
Inventors: 朱海峰; 郭敏; 温熙华; 陈鹏飞
Original assignee: Zhejiang Haikang Zhilian Technology Co ltd
Current assignee: Zhejiang Haikang Zhilian Technology Co ltd
Priority date: 2021-07-28
Filing date: 2021-07-28
Publication date: 2021-10-15
Anticipated expiration: 2041-07-28
Also published as: CN113506450B

Abstract

A single-point signal timing scheme selection method based on Qlearning gives consideration to stability and flexibility of signal timing optimization, and explores and selects schemes from an upper safety search area and a lower safety search area of an intersection on the basis of an original fixed timing scheme in the intersection time period to realize stability of control; meanwhile, timely response is made to the relatively long-term slow or abnormal change in the time period, and the flexibility of control is achieved. And selecting a signal timing scheme matched with the current traffic environment state according to the finally obtained Q value table through continuous training.

Description

Qspare-based single-point signal timing scheme selection method

Technical Field

The invention relates to the field of traffic signal control, in particular to a Qlearning-based single-point signal timing scheme selection method.

Background

At present, intersection signal control often adopts a multi-period fixed timing scheme, and the multi-period timing scheme with the simple design can not adapt to traffic requirements along with long-term or short-term changes of traffic environment, so that unnecessary delay and even partial period congestion are caused. Therefore, it is necessary to optimize the scheme in real time in a time period, a common real-time optimization method does not perform learning according to feedback, and the calculation process is complex, or the change is too flexible and unsafe, which is not beneficial to implementation and flow operation, and cannot completely meet the requirement of dynamic traffic signal timing.

There are related patent cases in the prior art, such as:

according to the patent of the single-point signal control optimization method based on the intersection traffic record (patent application number: 201610971018.5), the green light utilization index is analyzed according to the traffic flow and the queuing condition, and if the green light time is left, the green light time is shortened. However, the method is only suitable for the condition of small traffic flow at the intersection, and a proper signal control scheme cannot be generated when sudden large flow occurs.

In the patent of 'intersection signal timing optimization method for reducing motor vehicle exhaust emission' (patent application number: 201510628335.2), a signal timing optimization model for minimizing the motor vehicle emission is constructed according to traffic flow theory and operation research, but the scheme needs to calibrate an emission factor and solve the model by adopting quadratic programming. The method is complex in implementation and calculation process and is not beneficial to practical application.

Disclosure of Invention

According to the problems brought forward by the background technology, the invention provides a Qlearning-based single-point signal timing scheme selection method, which distinguishes conventional and abnormal states through distinguishing traffic environment states in a time period, selects and executes timing scheme action behaviors in corresponding states, acts on the current traffic environment, analyzes intersection states, and gives corresponding reward or punishment feedback according to the states, wherein the reward or punishment is used for strengthening the mapping relation between the environment states and the optimal scheme selection. By repeatedly performing this mapping process, the learning model can obtain the ability to select the best solution under normal and abnormal environmental conditions over a period of time. The invention will be further elucidated below.

S1, state space definition;

to be able to describe both the normal state and the abnormal state, the state space is defined as S ═ C, F, where C represents the state set and F represents the state switches.

In order to enable the invention to quickly converge and to quickly respond to changes in traffic environment, the state set is simply designed. Selecting a fixed timing scheme running in a certain time period as a reference scheme, and expanding l-1 sets of schemes in the upper and lower directions of the reference scheme respectively, wherein the value of l is selected according to the actual application condition. The state set C provides a total of 2l-1 sets of solutions, C ═ P1, P2, … Pl, …, P (2l-1)), where Pl is the original baseline solution, P1 is the downward extension 1, P (Pl-1) is the downward extension Pl-1, and P (2l-1) is the upward extension 2 l-1.

In order to distinguish whether the traffic state is abnormal or not, a switching value F for normal and abnormal states is set:

in the formula, y is a key flow ratio of the intersection. Assume that the intersection is at the j phase,

q_iflow of critical traffic, s, for phase i_iIs the saturated flow of the lane. y is_nowAs the current key flow ratio, y_lastThe time interval fair critical flow ratio, obtained for data analysis, with y_lastThe increase of the same ratio over e can be judged as an abnormal state, and e can be set according to the actual intersection condition.

S2, defining an action space;

in single-point signal timing optimization, a complete motion space includes all possible motions of the intersection within a time step, i.e., all possible signal timing schemes. Considering that the convergence speed of the algorithm is influenced by too large motion space, the motion is simplified into the selection of w sets of schemes. The action space is defined as: and a is (a1, a2, … am, …, aw), wherein am is the mth set of signal timing scheme in the motion space. In each scheme, the periods are different; the phase duration in each scheme can be distributed and adjusted according to the flow ratio of the key traffic flow of each phase.

In order to simplify the algorithm, the action space in the abnormal state and the action space in the normal state are set to be the same, and the action space needs to be simultaneously covered to the timing scheme space in the normal state and the abnormal state; in practical application, the action spaces can be set according to the normal state and the abnormal state respectively, and the action spaces of the normal state and the abnormal state can be set to be different.

S3, a reward function;

the return function can be obtained by calculating the index values of delay time, parking times, queuing length and the like, the index values can be directly obtained through simulation software, and the average delay of vehicles at the intersection is selected as an evaluation index.

Firstly, analyzing and obtaining upper limit values d of different types of delay variation ranges in the time interval of the intersection through a clustering algorithm, wherein as shown in fig. 3, the ordinate is a clustering result type value, the abscissa is an average vehicle delay value, and the unit is s; : "x" on category 0 represents a clustering center point of a normal delay category, and "x" represents an upper limit of a normal delay value on 80% of the quantiles; on the ordinate, "″" represents the cluster center point of the abnormal delay category "x" represents the upper limit of the abnormal delay value in 80% of the divisions "1".

The reward and punishment function is:

in the formula: d_t0For delays before the execution of an action, d_tkIs a delay after the action is performed.

In order to prevent delay mutation and oscillation of a reward and punishment function caused by the fluctuation of traffic, a continuous same action mark b is set, and if the same action is continuously carried out twice, b is 2; if the same action is performed three times continuously, b is 3; and so on, adding 1 to the value of b every time the same continuous action is added; if the continuous operation is interrupted, b is 1.

For different b and r_t(s, a), dif adjustment feedback r_t(s, a), the rule is as follows:

dif＝d_tk-d_t0

when b is 2 and r_tWhen the (s, a) — 1, the selected scheme is selected for the second time, and the action selection strategy of the present invention adopts a greedy algorithm to know that the selected scheme is a relatively excellent set of schemes, perhaps due to the volatility of traffic, resulting in a delay increase. When the delay does not rise much, i.e. dif<k is, r can be corrected_t(s, a) ═ 0; when the rise amplitude of the delay is larger, namely dif is more than or equal to k, r can be kept_t(s, a) ═ 1. The value of k can be set empirically.

When b is>2 and r_tWhen the (s, a) — 1, it is indicated that the selected scheme has been selected three or more times, and it is known that the selected scheme is an excellent scheme, and the delay may be increased due to the fluctuation of traffic, or a change in traffic environment. When the delay does not rise much, i.e. dif<k is, r can be maintained_t(s, a) ═ -1; when the delay rise amplitude is large, namely dif is larger than or equal to k, the r can be corrected_tAnd (s, a) — b +1, enhancing the feedback value of the environmental change.

r_t(s, a) ═ 2, when r_t(s, a) ═ 1 and b ═ 2;

when r is_tWhen (s, a) is 2 and b is 2, resetting b to 1 prevents the occurrence of correction r as b increases immediately after the same action occurs_tWhen (s, a) — 1, even a small negative value, strong oscillation occurs, resulting in a situation of non-convergence.

S4, updating the Q value table;

updating the Q value selects a Bellman optimal equation:

Q_t+1(s_t,a_t)＝(1-α_t)Q_t(s_t,a_t)+α_t(r_t+1+γmaxQ_t(s_t+1,a_t+1))；

in the invention, two Q value tables are required to be established, one is used for recording conventional traffic, the other is used for recording abnormal traffic, and other parameters can be uniformly set. Alpha is the learning rate and gamma is the discount factor, the greater the learning rate, the less effective it is to retain the previous training. The larger the discount factor gamma, the greater the effect of the previous training. Alpha and gamma can be determined according to the characteristics of a specific intersection.

The action selection strategy adopts greedy algorithm, namely epsilon-greedy exploration strategy, and the self-increasing epsilon value and the random generation number r are set to be epsilon [0, 1 ∈]And selecting the learning action by comparing the sizes. Selecting a rule: when r is<When epsilon is generated, selecting the action with the maximum Q value in the current state; when r is>When the current state is equal to epsilon, an action is randomly selected to be executed. E is epsilon [ epsilon ]₁,ε₂]Epsilon self-increment rule: when the number of iterations n<＝N₁When epsilon is equal to epsilon₁+(ε₂-ε₁)/N₁N; when n is>N₁When epsilon is equal to epsilon₂。

According to the Qlearning rule, the Q value table is a matrix of (2l-1) xw, and the Q values of different behaviors in each state are updated in iteration according to the Bellman equation. The method aims to enable the Q value of the optimal behavior in each state to obtain the maximum value, so that the selection probability of the optimal behavior is higher and higher, the probability of the non-optimal behavior is lower and lower, and after the Q value matrix is finally converged, the optimal behavior can be selected with high probability in each state.

Has the advantages that: compared with the prior art, the method comprises the steps that 1) on the basis of an original fixed timing scheme in an intersection time period, the scheme is explored and selected from an upper safety search area and a lower safety search area of the intersection, and the stability of control is realized; 2) meanwhile, timely response is made to the relatively long-term slow or abnormal change in the time period, and the flexibility of control is reflected. The stability and the flexibility of signal timing optimization are considered, and the purpose of stably and flexibly improving the running condition of the intersection is achieved.

Drawings

FIG. 1: the invention discloses a schematic diagram of a timing scheme selection method;

FIG. 2: the scheme of the invention selects a schematic diagram;

FIG. 3: the vehicles at the intersections delay the clustering result chart;

FIG. 4: q value table initial value chart;

FIG. 5: training a Q value table in convergence;

FIG. 6: the algorithm is compared with a fixed time delay contrast curve.

Detailed Description

A specific embodiment of the present invention will be described in detail with reference to the accompanying drawings.

The method comprises the following steps:

step 1, determining a state space;

the state space is defined as S ═ C, F, where C represents the state set and F represents the state switches.

For convenience of understanding, a three-phase intersection on an urban main road is taken as an example, a fixed timing scheme running in a certain time period at the intersection is taken as a reference scheme, two sets of schemes are respectively expanded in the upper direction and the lower direction of the intersection, a state set C is provided with 5 sets of schemes, and C is (P1, P2, P3, P4 and P5), wherein P3 is an original reference scheme, P1 is a downward expansion scheme 1, P2 is a downward expansion scheme 2, P4 is an upward expansion scheme 4, and P5 is an upward expansion scheme 5.

Setting display of 5 sets of schemes in a state set: p3Each phase duration of the scheme is respectively set to

Cycle is Cycle₃(ii) a P1 is expansion scheme 1, and each phase duration is set to be

Cycle is Cycle₁(ii) a P2 expansion 2, each phase duration is set separately

Cycle is Cycle₂(ii) a P4 expansion scheme 4, each phase duration is set to

Cycle is Cycle₄(ii) a P5 expansion 5, each phase duration is set separately

Cycle is Cycle₅。

In order to distinguish whether the traffic state is abnormal or not, the present embodiment sets a switching amount F for the normal and abnormal states.

In the formula, y is a key flow ratio of the intersection, taking a three-phase intersection as an example, and y is (q)₁+q₂+q₃)/s，q₁、q₂、q₃The flow rates of the key traffic flows in the phase 1, the phase 2, and the phase 3, respectively, and s is the saturation flow rate of the lane, and it is assumed here that the saturation flow rates of the lanes are the same value. y is_nowAs the current key flow ratio, y_lastThe time interval fair critical flow ratio, obtained for data analysis, with y_lastThe increase of the same ratio over e can be judged as an abnormal state, and e can be set according to the actual intersection condition.

Step 2, determining an action space;

considering that the convergence speed of the algorithm is influenced by too large motion space, the motion is simplified into the selection of 5 sets of schemes. The action space is defined as: a ═ is (P1, P2, P3, P4, P5), where P1, P2, P3, P4, P5 are identical to P1, P2, P3, P4, P5 in the state space. The action space in the abnormal state and the action space in the normal state are set to be the same, and the action space needs to be simultaneously covered to the timing scheme space in the normal state and the abnormal state.

Step 3, determining a return function;

selecting average delay of vehicles at the intersection as an evaluation index;

firstly, analyzing and obtaining upper limit values d of different types of delay variation ranges in the intersection within the time interval through a clustering algorithm, wherein as shown in fig. 3, a normal delay upper limit d is 44s, and an abnormal delay upper limit d is 66 s;

the reward and punishment function is:

In order to prevent the delay mutation and the oscillation of the reward and punishment function caused by the fluctuation of the traffic, a continuous same action mark b is set: if the same action is carried out twice continuously, b is 2; if the same action is performed three times continuously, b is 3; and so on, adding 1 to the value of b every time the same continuous action is added; if the continuous action is interrupted, b is 1;

for different b and r_t(s, a), dif adjustment feedback r_t(s, a), the rule is as follows (here k is set to 10):

dif＝d_tk-d_t0

when b is 2 and r_tWhen (s, a) — 1, the selected scheme is selected for the second time, and the action selection strategy of the algorithm adopts a greedy algorithm to know that the selected scheme is a set of relatively excellent schemesThe solution of the show, perhaps due to the volatility of traffic, leads to a rise in delay: when the delay does not rise much, i.e. dif<When 10, can correct r_t(s, a) ═ 0; when the rise amplitude of the delay is larger, namely dif is more than or equal to 10, r can be kept_t(s,a)＝-1。

When b is>2 and r_tWhen (s, a) — 1, it is described that the selected scheme has been selected three or more times, it is known that the selected scheme is an excellent scheme, and the delay is increased due to the fluctuation of traffic, or the change of traffic environment: when the delay does not rise much, i.e. dif<At 10 hours, can maintain r_t(s, a) ═ -1; when the delay rise amplitude is large, namely dif is more than or equal to 10, the correction r can be corrected_tAnd (s, a) — b +1, enhancing the feedback value of the environmental change.

r_t(s, a) ═ 2, when r_t(s, a) ═ 1 and b ═ 2;

Step 4, determining the updating of the Q value table;

updating the Q value selects a Bellman optimal equation:

Q_t+1(s_t,a_t)＝(1-α_t)Q_t(s_t,a_t)+α_t(r_t+1+γmaxQ_t(s_t+1,a_t+1))；

N₁the value is 500, epsilon is equal to [0.7,0.9 ]]Epsilon self-increment rule: when the number of iterations n<When 500, epsilon is 0.7+0.2/500 x n; when n is>At 500, ε is 0.9.

The Q values are expressed as a5 × 5 matrix, and as shown in fig. 4, the row symbols s1, s2, s3, s4, and s5 represent 5 states (for ease of understanding, the state in the conventional reinforcement learning is used to represent the letter s, and the corresponding states used herein are p1, p2, p3, p4, and p5), and the lists a1, a2, a3, a4, and a5 represent 5 actions (for ease of understanding, the action in the conventional reinforcement learning is used to represent the letter a, and the corresponding actions used herein are p1, p2, p3, p4, and p 5).

The single point scheme selection method of the present invention is further illustrated below with reference to examples:

example 1: take an intersection in a city of a province as an example.

Setting display of 5 sets of schemes in a state set: the duration of each phase of the P3 scheme is respectively set to 54s, 34s and 44s, and the period is 132 s; p1 is expansion scheme 1, the duration of each phase is set to 30s, 24s, 32s, respectively, and the period is 86 s; p2 is expansion 2, the duration of each phase is set to 43s, 29s, 40s, respectively, and the period is 112 s; p4 is expansion scheme 4, the duration of each phase is set to 56s, 35s, 46s, respectively, and the period is 137 s; p5 is expansion 5, and each phase duration is set to 59s, 36s, 47s, respectively, and the period is 142 s.

The flow statistics and lane number in the time period are shown in the following table:

an inlet	East import	South import	West imported goods	North import
					Flow rate	300	800	300	400
Number of lanes	2	4	2	4

The initial value setting of the Q value table of the algorithm program at the single-point intersection is shown in fig. 4, the Q value table is updated by running the algorithm code, and the learning result is also a5 × 5 matrix, as shown in fig. 5.

The intersection scheme of the algorithm is switched continuously according to the learning rule in iteration, and the Q value table with different convergence conditions can be obtained by setting values of different learning times N. The larger the value of N is, the better the convergence condition of the Q value table is, but the corresponding time consumption is increased. In the experiment, N is 540, and the total average delay D is counted for 30 times of iteration_iThe total average delay was compared for a total of 18 statistical (540/30 ═ 18) iterations 540.

Where m is the number of iterations (m ∈ [1,540 ]]) And i is a statistic degree mark (i belongs to [1,18 ]])，

For each iteration of the vehicle delay, D_iThe sum of the delays for the ith 30 iteration cars.

Fig. 6 is a graph comparing the delay of the present invention with a fixed time, with an iteration number index i on the abscissa and a total vehicle mean delay D _ i on the ordinate, as shown in the figure,

compared with other methods, the method has the advantages that after convergence, the normal Q value table or the abnormal Q value table can quickly respond to the long-term slow change of the traffic environment according to the return function and the selection strategy formulated by the method. Because the optimal behavior after convergence is continuously selected approximately, if the optimal behavior is not matched with the traffic environment at the moment, the optimal behavior is continuously selected and subjected to continuously-increased punishment, and the selection probability is rapidly reduced until a new convergence state is reached.

The invention can give consideration to the stability and flexibility of signal timing optimization, and explores and selects the schemes to the upper and lower safety search areas of the intersection based on the original fixed timing scheme in the time interval of the intersection, thereby realizing the stability of control; meanwhile, timely response is made to the relatively long-term slow or abnormal change in the time period, and the flexibility of control is reflected. Therefore, the aim of stably and flexibly improving the running condition of the intersection is finally achieved.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A single-point signal timing scheme selection method based on Qlearning is characterized by comprising the following steps:

s1, state space definition;

defining a state space as S ═ C, F, where C represents a state set and F represents a state switch;

simplifying the state set, selecting a fixed timing scheme running in a certain time period as a reference scheme Pl, expanding l-1 sets of schemes in the upper and lower directions of the reference scheme Pl, wherein the value of l is selected according to the actual application condition, the state set C is provided with 2l-1 sets of schemes in total, C is (P1, P2, … Pl, … and P (2l-1)), wherein Pl is an original reference scheme, P1 is a downward expansion scheme 1, P (l-1) is a downward expansion scheme l-1, and P (2l-1) is an upward expansion scheme 2 l-1;

setting a switching value F to distinguish whether the traffic state is abnormal or not:

in the formula, y is the key flow ratio of the intersection, the intersection is the j phase, and q is_iFlow of critical traffic, s, for phase i_iIs the saturated flow of the lane, y_nowAs the current key flow ratio, y_lastThe time interval fair key flow ratio is obtained according to historical data analysis, and y_lastIf the increase of the same ratio exceeds e, the abnormal state can be judged, and e can be set according to the actual intersection condition;

s2, defining an action space;

a complete action space comprises all possible signal timing schemes of the intersection within a time step, and is defined as: a ═ (a1, a2, … am, …, aw);

wherein am is the mth set of signal timing scheme in the motion space; the periods in each scheme are different, and the phase duration in each scheme can be distributed and adjusted according to the flow ratio of the key traffic flow of each phase;

s3, a reward function;

the return function is obtained by calculation according to delay time which is directly obtained by simulation software or calculated in practical application;

firstly, analyzing and obtaining the upper limit value d of different types of delay variation ranges in the time interval of the intersection through a clustering algorithm;

next, define the reward and penalty function as:

in the formula: d_t0For delays before the execution of an action, d_tkDelay after action execution;

setting a continuous same action flag b, and if two continuous same actions are carried out, setting b as 2; if the same action is performed three times continuously, b is 3; and so on, adding 1 to the value of b every time the same continuous action is added; if the continuous action is interrupted, b is 1;

dif＝d_tk-d_t0；

when b is 2 and r_tWhen (s, a) — 1, it means that the selected scheme has been selected for the second time, and the action selection strategy adopts a greedy algorithm, and it can be known that the selected scheme is a relatively excellent set of schemes; when the delay does not rise much, i.e. dif<k is, r is corrected_t(s, a) ═ 0; when the rise amplitude of the delay is larger, namely dif is more than or equal to k, r is kept_tThe value of (s, a) — 1, k can be set empirically;

when b is>2 and r_tWhen the (s, a) — 1, it is indicated that the selected scheme has been selected three or more times in succession, and it is known that the selected scheme is a relatively excellent scheme; when the delay does not rise much, i.e. dif<k is, r is held_t(s, a) ═ -1; when the rise amplitude of the delay is larger, namely dif is more than or equal to k, r is corrected_t(s, a) — b +1, enhancing the feedback value of the environmental change;

r_t(s, a) ═ 2, when r_t(s, a) ═ 1 and b ═ 2;

when r is_tWhen (s, a) is 2 and b is 2, resetting b to 1 prevents the occurrence of correction r as b increases immediately after the same action occurs_t(s, a) — 1, even smaller negative values, appear strongly oscillating, non-converging.

S4, updating the Q value table;

establishing two Q value tables, wherein one Q value table is used for recording conventional traffic, the other Q value table is used for recording abnormal traffic, other parameters are uniformly set, and the Q value is updated by selecting a Bellman optimal equation:

Q_t+1(s_t,a_t)＝(1-α_t)Q_t(s_t,a_t)+α_t(r_t+1+γmaxQ_t(s_t+1,a_t+1))；

wherein alpha is a learning rate, gamma is a discount factor, and alpha and gamma are determined according to specific intersection characteristics;

the action selection strategy is an epsilon-greedy exploration strategy, the Q value table is a matrix of (2l-1) xw according to the Qlerning rule, and the Q values of different behaviors in each state are updated in iteration according to the Bellman equation.

2. The timing scheme selection method according to claim 1, wherein the epsilon-greedy exploration strategy selects the learning action by setting the comparison size between the self-increased epsilon value and the random generation number r epsilon [0, 1], and the selection rule is as follows:

when r is less than epsilon, selecting the action with the maximum Q value in the current state,

when r > is epsilon, randomly selecting an action to execute in the current state;

ε∈[ε₁,ε₂]epsilon self-increment rule:

when the number of iterations n<＝N₁When epsilon is equal to epsilon₁+(ε₂-ε₁)/N₁N; when n is>N₁When epsilon is equal to epsilon₂。

3. The timing scheme selection method according to claim 2, wherein the action is selected as a selection of 5 sets of schemes, and the action space is defined as: a ═ is (P1, P2, P3, P4, P5), where P1, P2, P3, P4, P5 are identical to P1, P2, P3, P4, P5 in the state space.

4. The timing scheme selection method of claim 3, wherein: the action space in the abnormal state and the action space in the normal state are set to be the same, and the action space covers the timing scheme space in the normal state and the abnormal state at the same time.

5. The timing scheme selection method of claim 4, wherein: the value of k is empirically set to 10.