CN113506450A - Qspare-based single-point signal timing scheme selection method - Google Patents

Qspare-based single-point signal timing scheme selection method Download PDF

Info

Publication number
CN113506450A
CN113506450A CN202110856591.2A CN202110856591A CN113506450A CN 113506450 A CN113506450 A CN 113506450A CN 202110856591 A CN202110856591 A CN 202110856591A CN 113506450 A CN113506450 A CN 113506450A
Authority
CN
China
Prior art keywords
scheme
action
state
value
epsilon
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110856591.2A
Other languages
Chinese (zh)
Other versions
CN113506450B (en
Inventor
朱海峰
郭敏
温熙华
陈鹏飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Haikang Zhilian Technology Co ltd
Original Assignee
Zhejiang Haikang Zhilian Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Haikang Zhilian Technology Co ltd filed Critical Zhejiang Haikang Zhilian Technology Co ltd
Priority to CN202110856591.2A priority Critical patent/CN113506450B/en
Publication of CN113506450A publication Critical patent/CN113506450A/en
Application granted granted Critical
Publication of CN113506450B publication Critical patent/CN113506450B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/07Controlling traffic signals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/01Detecting movement of traffic to be counted or controlled
    • G08G1/0104Measuring and analyzing of parameters relative to traffic conditions
    • G08G1/0125Traffic data processing
    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/07Controlling traffic signals
    • G08G1/08Controlling traffic signals according to detected number or speed of vehicles
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

A single-point signal timing scheme selection method based on Qlearning gives consideration to stability and flexibility of signal timing optimization, and explores and selects schemes from an upper safety search area and a lower safety search area of an intersection on the basis of an original fixed timing scheme in the intersection time period to realize stability of control; meanwhile, timely response is made to the relatively long-term slow or abnormal change in the time period, and the flexibility of control is achieved. And selecting a signal timing scheme matched with the current traffic environment state according to the finally obtained Q value table through continuous training.

Description

Qspare-based single-point signal timing scheme selection method
Technical Field
The invention relates to the field of traffic signal control, in particular to a Qlearning-based single-point signal timing scheme selection method.
Background
At present, intersection signal control often adopts a multi-period fixed timing scheme, and the multi-period timing scheme with the simple design can not adapt to traffic requirements along with long-term or short-term changes of traffic environment, so that unnecessary delay and even partial period congestion are caused. Therefore, it is necessary to optimize the scheme in real time in a time period, a common real-time optimization method does not perform learning according to feedback, and the calculation process is complex, or the change is too flexible and unsafe, which is not beneficial to implementation and flow operation, and cannot completely meet the requirement of dynamic traffic signal timing.
There are related patent cases in the prior art, such as:
according to the patent of the single-point signal control optimization method based on the intersection traffic record (patent application number: 201610971018.5), the green light utilization index is analyzed according to the traffic flow and the queuing condition, and if the green light time is left, the green light time is shortened. However, the method is only suitable for the condition of small traffic flow at the intersection, and a proper signal control scheme cannot be generated when sudden large flow occurs.
In the patent of 'intersection signal timing optimization method for reducing motor vehicle exhaust emission' (patent application number: 201510628335.2), a signal timing optimization model for minimizing the motor vehicle emission is constructed according to traffic flow theory and operation research, but the scheme needs to calibrate an emission factor and solve the model by adopting quadratic programming. The method is complex in implementation and calculation process and is not beneficial to practical application.
Disclosure of Invention
According to the problems brought forward by the background technology, the invention provides a Qlearning-based single-point signal timing scheme selection method, which distinguishes conventional and abnormal states through distinguishing traffic environment states in a time period, selects and executes timing scheme action behaviors in corresponding states, acts on the current traffic environment, analyzes intersection states, and gives corresponding reward or punishment feedback according to the states, wherein the reward or punishment is used for strengthening the mapping relation between the environment states and the optimal scheme selection. By repeatedly performing this mapping process, the learning model can obtain the ability to select the best solution under normal and abnormal environmental conditions over a period of time. The invention will be further elucidated below.
S1, state space definition;
to be able to describe both the normal state and the abnormal state, the state space is defined as S ═ C, F, where C represents the state set and F represents the state switches.
In order to enable the invention to quickly converge and to quickly respond to changes in traffic environment, the state set is simply designed. Selecting a fixed timing scheme running in a certain time period as a reference scheme, and expanding l-1 sets of schemes in the upper and lower directions of the reference scheme respectively, wherein the value of l is selected according to the actual application condition. The state set C provides a total of 2l-1 sets of solutions, C ═ P1, P2, … Pl, …, P (2l-1)), where Pl is the original baseline solution, P1 is the downward extension 1, P (Pl-1) is the downward extension Pl-1, and P (2l-1) is the upward extension 2 l-1.
In order to distinguish whether the traffic state is abnormal or not, a switching value F for normal and abnormal states is set:
Figure BDA0003184264110000021
in the formula, y is a key flow ratio of the intersection. Assume that the intersection is at the j phase,
Figure BDA0003184264110000022
Figure BDA0003184264110000023
qiflow of critical traffic, s, for phase iiIs the saturated flow of the lane. y isnowAs the current key flow ratio, ylastThe time interval fair critical flow ratio, obtained for data analysis, with ylastThe increase of the same ratio over e can be judged as an abnormal state, and e can be set according to the actual intersection condition.
S2, defining an action space;
in single-point signal timing optimization, a complete motion space includes all possible motions of the intersection within a time step, i.e., all possible signal timing schemes. Considering that the convergence speed of the algorithm is influenced by too large motion space, the motion is simplified into the selection of w sets of schemes. The action space is defined as: and a is (a1, a2, … am, …, aw), wherein am is the mth set of signal timing scheme in the motion space. In each scheme, the periods are different; the phase duration in each scheme can be distributed and adjusted according to the flow ratio of the key traffic flow of each phase.
In order to simplify the algorithm, the action space in the abnormal state and the action space in the normal state are set to be the same, and the action space needs to be simultaneously covered to the timing scheme space in the normal state and the abnormal state; in practical application, the action spaces can be set according to the normal state and the abnormal state respectively, and the action spaces of the normal state and the abnormal state can be set to be different.
S3, a reward function;
the return function can be obtained by calculating the index values of delay time, parking times, queuing length and the like, the index values can be directly obtained through simulation software, and the average delay of vehicles at the intersection is selected as an evaluation index.
Firstly, analyzing and obtaining upper limit values d of different types of delay variation ranges in the time interval of the intersection through a clustering algorithm, wherein as shown in fig. 3, the ordinate is a clustering result type value, the abscissa is an average vehicle delay value, and the unit is s; : "x" on category 0 represents a clustering center point of a normal delay category, and "x" represents an upper limit of a normal delay value on 80% of the quantiles; on the ordinate, "″" represents the cluster center point of the abnormal delay category "x" represents the upper limit of the abnormal delay value in 80% of the divisions "1".
The reward and punishment function is:
Figure BDA0003184264110000031
in the formula: dt0For delays before the execution of an action, dtkIs a delay after the action is performed.
In order to prevent delay mutation and oscillation of a reward and punishment function caused by the fluctuation of traffic, a continuous same action mark b is set, and if the same action is continuously carried out twice, b is 2; if the same action is performed three times continuously, b is 3; and so on, adding 1 to the value of b every time the same continuous action is added; if the continuous operation is interrupted, b is 1.
For different b and rt(s, a), dif adjustment feedback rt(s, a), the rule is as follows:
Figure BDA0003184264110000032
dif=dtk-dt0
when b is 2 and rtWhen the (s, a) — 1, the selected scheme is selected for the second time, and the action selection strategy of the present invention adopts a greedy algorithm to know that the selected scheme is a relatively excellent set of schemes, perhaps due to the volatility of traffic, resulting in a delay increase. When the delay does not rise much, i.e. dif<k is, r can be correctedt(s, a) ═ 0; when the rise amplitude of the delay is larger, namely dif is more than or equal to k, r can be keptt(s, a) ═ 1. The value of k can be set empirically.
Figure BDA0003184264110000033
When b is>2 and rtWhen the (s, a) — 1, it is indicated that the selected scheme has been selected three or more times, and it is known that the selected scheme is an excellent scheme, and the delay may be increased due to the fluctuation of traffic, or a change in traffic environment. When the delay does not rise much, i.e. dif<k is, r can be maintainedt(s, a) ═ -1; when the delay rise amplitude is large, namely dif is larger than or equal to k, the r can be correctedtAnd (s, a) — b +1, enhancing the feedback value of the environmental change.
rt(s, a) ═ 2, when rt(s, a) ═ 1 and b ═ 2;
when r istWhen (s, a) is 2 and b is 2, resetting b to 1 prevents the occurrence of correction r as b increases immediately after the same action occurstWhen (s, a) — 1, even a small negative value, strong oscillation occurs, resulting in a situation of non-convergence.
S4, updating the Q value table;
updating the Q value selects a Bellman optimal equation:
Qt+1(st,at)=(1-αt)Qt(st,at)+αt(rt+1+γmaxQt(st+1,at+1));
in the invention, two Q value tables are required to be established, one is used for recording conventional traffic, the other is used for recording abnormal traffic, and other parameters can be uniformly set. Alpha is the learning rate and gamma is the discount factor, the greater the learning rate, the less effective it is to retain the previous training. The larger the discount factor gamma, the greater the effect of the previous training. Alpha and gamma can be determined according to the characteristics of a specific intersection.
The action selection strategy adopts greedy algorithm, namely epsilon-greedy exploration strategy, and the self-increasing epsilon value and the random generation number r are set to be epsilon [0, 1 ∈]And selecting the learning action by comparing the sizes. Selecting a rule: when r is<When epsilon is generated, selecting the action with the maximum Q value in the current state; when r is>When the current state is equal to epsilon, an action is randomly selected to be executed. E is epsilon [ epsilon ]12]Epsilon self-increment rule: when the number of iterations n<=N1When epsilon is equal to epsilon1+(ε21)/N1N; when n is>N1When epsilon is equal to epsilon2
According to the Qlearning rule, the Q value table is a matrix of (2l-1) xw, and the Q values of different behaviors in each state are updated in iteration according to the Bellman equation. The method aims to enable the Q value of the optimal behavior in each state to obtain the maximum value, so that the selection probability of the optimal behavior is higher and higher, the probability of the non-optimal behavior is lower and lower, and after the Q value matrix is finally converged, the optimal behavior can be selected with high probability in each state.
Has the advantages that: compared with the prior art, the method comprises the steps that 1) on the basis of an original fixed timing scheme in an intersection time period, the scheme is explored and selected from an upper safety search area and a lower safety search area of the intersection, and the stability of control is realized; 2) meanwhile, timely response is made to the relatively long-term slow or abnormal change in the time period, and the flexibility of control is reflected. The stability and the flexibility of signal timing optimization are considered, and the purpose of stably and flexibly improving the running condition of the intersection is achieved.
Drawings
FIG. 1: the invention discloses a schematic diagram of a timing scheme selection method;
FIG. 2: the scheme of the invention selects a schematic diagram;
FIG. 3: the vehicles at the intersections delay the clustering result chart;
FIG. 4: q value table initial value chart;
FIG. 5: training a Q value table in convergence;
FIG. 6: the algorithm is compared with a fixed time delay contrast curve.
Detailed Description
A specific embodiment of the present invention will be described in detail with reference to the accompanying drawings.
A single-point signal timing scheme selection method based on Qlearning gives consideration to stability and flexibility of signal timing optimization, and explores and selects schemes from an upper safety search area and a lower safety search area of an intersection on the basis of an original fixed timing scheme in the intersection time period to realize stability of control; meanwhile, timely response is made to the relatively long-term slow or abnormal change in the time period, and the flexibility of control is achieved. And selecting a signal timing scheme matched with the current traffic environment state according to the finally obtained Q value table through continuous training.
The method comprises the following steps:
step 1, determining a state space;
the state space is defined as S ═ C, F, where C represents the state set and F represents the state switches.
For convenience of understanding, a three-phase intersection on an urban main road is taken as an example, a fixed timing scheme running in a certain time period at the intersection is taken as a reference scheme, two sets of schemes are respectively expanded in the upper direction and the lower direction of the intersection, a state set C is provided with 5 sets of schemes, and C is (P1, P2, P3, P4 and P5), wherein P3 is an original reference scheme, P1 is a downward expansion scheme 1, P2 is a downward expansion scheme 2, P4 is an upward expansion scheme 4, and P5 is an upward expansion scheme 5.
Setting display of 5 sets of schemes in a state set: p3Each phase duration of the scheme is respectively set to
Figure BDA0003184264110000051
Cycle is Cycle3(ii) a P1 is expansion scheme 1, and each phase duration is set to be
Figure BDA0003184264110000052
Cycle is Cycle1(ii) a P2 expansion 2, each phase duration is set separately
Figure BDA0003184264110000053
Cycle is Cycle2(ii) a P4 expansion scheme 4, each phase duration is set to
Figure BDA0003184264110000054
Cycle is Cycle4(ii) a P5 expansion 5, each phase duration is set separately
Figure BDA0003184264110000055
Cycle is Cycle5
In order to distinguish whether the traffic state is abnormal or not, the present embodiment sets a switching amount F for the normal and abnormal states.
Figure BDA0003184264110000056
In the formula, y is a key flow ratio of the intersection, taking a three-phase intersection as an example, and y is (q)1+q2+q3)/s,q1、q2、q3The flow rates of the key traffic flows in the phase 1, the phase 2, and the phase 3, respectively, and s is the saturation flow rate of the lane, and it is assumed here that the saturation flow rates of the lanes are the same value. y isnowAs the current key flow ratio, ylastThe time interval fair critical flow ratio, obtained for data analysis, with ylastThe increase of the same ratio over e can be judged as an abnormal state, and e can be set according to the actual intersection condition.
Step 2, determining an action space;
considering that the convergence speed of the algorithm is influenced by too large motion space, the motion is simplified into the selection of 5 sets of schemes. The action space is defined as: a ═ is (P1, P2, P3, P4, P5), where P1, P2, P3, P4, P5 are identical to P1, P2, P3, P4, P5 in the state space. The action space in the abnormal state and the action space in the normal state are set to be the same, and the action space needs to be simultaneously covered to the timing scheme space in the normal state and the abnormal state.
Step 3, determining a return function;
selecting average delay of vehicles at the intersection as an evaluation index;
firstly, analyzing and obtaining upper limit values d of different types of delay variation ranges in the intersection within the time interval through a clustering algorithm, wherein as shown in fig. 3, a normal delay upper limit d is 44s, and an abnormal delay upper limit d is 66 s;
the reward and punishment function is:
Figure BDA0003184264110000061
in the formula: dt0For delays before the execution of an action, dtkIs a delay after the action is performed.
In order to prevent the delay mutation and the oscillation of the reward and punishment function caused by the fluctuation of the traffic, a continuous same action mark b is set: if the same action is carried out twice continuously, b is 2; if the same action is performed three times continuously, b is 3; and so on, adding 1 to the value of b every time the same continuous action is added; if the continuous action is interrupted, b is 1;
for different b and rt(s, a), dif adjustment feedback rt(s, a), the rule is as follows (here k is set to 10):
Figure BDA0003184264110000062
dif=dtk-dt0
when b is 2 and rtWhen (s, a) — 1, the selected scheme is selected for the second time, and the action selection strategy of the algorithm adopts a greedy algorithm to know that the selected scheme is a set of relatively excellent schemesThe solution of the show, perhaps due to the volatility of traffic, leads to a rise in delay: when the delay does not rise much, i.e. dif<When 10, can correct rt(s, a) ═ 0; when the rise amplitude of the delay is larger, namely dif is more than or equal to 10, r can be keptt(s,a)=-1。
Figure BDA0003184264110000063
When b is>2 and rtWhen (s, a) — 1, it is described that the selected scheme has been selected three or more times, it is known that the selected scheme is an excellent scheme, and the delay is increased due to the fluctuation of traffic, or the change of traffic environment: when the delay does not rise much, i.e. dif<At 10 hours, can maintain rt(s, a) ═ -1; when the delay rise amplitude is large, namely dif is more than or equal to 10, the correction r can be correctedtAnd (s, a) — b +1, enhancing the feedback value of the environmental change.
rt(s, a) ═ 2, when rt(s, a) ═ 1 and b ═ 2;
when r istWhen (s, a) is 2 and b is 2, resetting b to 1 prevents the occurrence of correction r as b increases immediately after the same action occurstWhen (s, a) — 1, even a small negative value, strong oscillation occurs, resulting in a situation of non-convergence.
Step 4, determining the updating of the Q value table;
updating the Q value selects a Bellman optimal equation:
Qt+1(st,at)=(1-αt)Qt(st,at)+αt(rt+1+γmaxQt(st+1,at+1));
N1the value is 500, epsilon is equal to [0.7,0.9 ]]Epsilon self-increment rule: when the number of iterations n<When 500, epsilon is 0.7+0.2/500 x n; when n is>At 500, ε is 0.9.
The Q values are expressed as a5 × 5 matrix, and as shown in fig. 4, the row symbols s1, s2, s3, s4, and s5 represent 5 states (for ease of understanding, the state in the conventional reinforcement learning is used to represent the letter s, and the corresponding states used herein are p1, p2, p3, p4, and p5), and the lists a1, a2, a3, a4, and a5 represent 5 actions (for ease of understanding, the action in the conventional reinforcement learning is used to represent the letter a, and the corresponding actions used herein are p1, p2, p3, p4, and p 5).
The single point scheme selection method of the present invention is further illustrated below with reference to examples:
example 1: take an intersection in a city of a province as an example.
Setting display of 5 sets of schemes in a state set: the duration of each phase of the P3 scheme is respectively set to 54s, 34s and 44s, and the period is 132 s; p1 is expansion scheme 1, the duration of each phase is set to 30s, 24s, 32s, respectively, and the period is 86 s; p2 is expansion 2, the duration of each phase is set to 43s, 29s, 40s, respectively, and the period is 112 s; p4 is expansion scheme 4, the duration of each phase is set to 56s, 35s, 46s, respectively, and the period is 137 s; p5 is expansion 5, and each phase duration is set to 59s, 36s, 47s, respectively, and the period is 142 s.
The flow statistics and lane number in the time period are shown in the following table:
an inlet East import South import West imported goods North import
Flow rate 300 800 300 400
Number of lanes 2 4 2 4
The initial value setting of the Q value table of the algorithm program at the single-point intersection is shown in fig. 4, the Q value table is updated by running the algorithm code, and the learning result is also a5 × 5 matrix, as shown in fig. 5.
The intersection scheme of the algorithm is switched continuously according to the learning rule in iteration, and the Q value table with different convergence conditions can be obtained by setting values of different learning times N. The larger the value of N is, the better the convergence condition of the Q value table is, but the corresponding time consumption is increased. In the experiment, N is 540, and the total average delay D is counted for 30 times of iterationiThe total average delay was compared for a total of 18 statistical (540/30 ═ 18) iterations 540.
Figure BDA0003184264110000081
Where m is the number of iterations (m ∈ [1,540 ]]) And i is a statistic degree mark (i belongs to [1,18 ]]),
Figure BDA0003184264110000082
For each iteration of the vehicle delay, DiThe sum of the delays for the ith 30 iteration cars.
Fig. 6 is a graph comparing the delay of the present invention with a fixed time, with an iteration number index i on the abscissa and a total vehicle mean delay D _ i on the ordinate, as shown in the figure,
compared with other methods, the method has the advantages that after convergence, the normal Q value table or the abnormal Q value table can quickly respond to the long-term slow change of the traffic environment according to the return function and the selection strategy formulated by the method. Because the optimal behavior after convergence is continuously selected approximately, if the optimal behavior is not matched with the traffic environment at the moment, the optimal behavior is continuously selected and subjected to continuously-increased punishment, and the selection probability is rapidly reduced until a new convergence state is reached.
The invention can give consideration to the stability and flexibility of signal timing optimization, and explores and selects the schemes to the upper and lower safety search areas of the intersection based on the original fixed timing scheme in the time interval of the intersection, thereby realizing the stability of control; meanwhile, timely response is made to the relatively long-term slow or abnormal change in the time period, and the flexibility of control is reflected. Therefore, the aim of stably and flexibly improving the running condition of the intersection is finally achieved.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (5)

1. A single-point signal timing scheme selection method based on Qlearning is characterized by comprising the following steps:
s1, state space definition;
defining a state space as S ═ C, F, where C represents a state set and F represents a state switch;
simplifying the state set, selecting a fixed timing scheme running in a certain time period as a reference scheme Pl, expanding l-1 sets of schemes in the upper and lower directions of the reference scheme Pl, wherein the value of l is selected according to the actual application condition, the state set C is provided with 2l-1 sets of schemes in total, C is (P1, P2, … Pl, … and P (2l-1)), wherein Pl is an original reference scheme, P1 is a downward expansion scheme 1, P (l-1) is a downward expansion scheme l-1, and P (2l-1) is an upward expansion scheme 2 l-1;
setting a switching value F to distinguish whether the traffic state is abnormal or not:
Figure FDA0003184264100000011
Figure FDA0003184264100000012
in the formula, y is the key flow ratio of the intersection, the intersection is the j phase, and q isiFlow of critical traffic, s, for phase iiIs the saturated flow of the lane, ynowAs the current key flow ratio, ylastThe time interval fair key flow ratio is obtained according to historical data analysis, and ylastIf the increase of the same ratio exceeds e, the abnormal state can be judged, and e can be set according to the actual intersection condition;
s2, defining an action space;
a complete action space comprises all possible signal timing schemes of the intersection within a time step, and is defined as: a ═ (a1, a2, … am, …, aw);
wherein am is the mth set of signal timing scheme in the motion space; the periods in each scheme are different, and the phase duration in each scheme can be distributed and adjusted according to the flow ratio of the key traffic flow of each phase;
s3, a reward function;
the return function is obtained by calculation according to delay time which is directly obtained by simulation software or calculated in practical application;
firstly, analyzing and obtaining the upper limit value d of different types of delay variation ranges in the time interval of the intersection through a clustering algorithm;
next, define the reward and penalty function as:
Figure FDA0003184264100000013
in the formula: dt0For delays before the execution of an action, dtkDelay after action execution;
setting a continuous same action flag b, and if two continuous same actions are carried out, setting b as 2; if the same action is performed three times continuously, b is 3; and so on, adding 1 to the value of b every time the same continuous action is added; if the continuous action is interrupted, b is 1;
for different b and rt(s, a), dif adjustment feedback rt(s, a), the rule is as follows:
Figure FDA0003184264100000021
dif=dtk-dt0
when b is 2 and rtWhen (s, a) — 1, it means that the selected scheme has been selected for the second time, and the action selection strategy adopts a greedy algorithm, and it can be known that the selected scheme is a relatively excellent set of schemes; when the delay does not rise much, i.e. dif<k is, r is correctedt(s, a) ═ 0; when the rise amplitude of the delay is larger, namely dif is more than or equal to k, r is kepttThe value of (s, a) — 1, k can be set empirically;
Figure FDA0003184264100000022
when b is>2 and rtWhen the (s, a) — 1, it is indicated that the selected scheme has been selected three or more times in succession, and it is known that the selected scheme is a relatively excellent scheme; when the delay does not rise much, i.e. dif<k is, r is heldt(s, a) ═ -1; when the rise amplitude of the delay is larger, namely dif is more than or equal to k, r is correctedt(s, a) — b +1, enhancing the feedback value of the environmental change;
rt(s, a) ═ 2, when rt(s, a) ═ 1 and b ═ 2;
when r istWhen (s, a) is 2 and b is 2, resetting b to 1 prevents the occurrence of correction r as b increases immediately after the same action occurst(s, a) — 1, even smaller negative values, appear strongly oscillating, non-converging.
S4, updating the Q value table;
establishing two Q value tables, wherein one Q value table is used for recording conventional traffic, the other Q value table is used for recording abnormal traffic, other parameters are uniformly set, and the Q value is updated by selecting a Bellman optimal equation:
Qt+1(st,at)=(1-αt)Qt(st,at)+αt(rt+1+γmaxQt(st+1,at+1));
wherein alpha is a learning rate, gamma is a discount factor, and alpha and gamma are determined according to specific intersection characteristics;
the action selection strategy is an epsilon-greedy exploration strategy, the Q value table is a matrix of (2l-1) xw according to the Qlerning rule, and the Q values of different behaviors in each state are updated in iteration according to the Bellman equation.
2. The timing scheme selection method according to claim 1, wherein the epsilon-greedy exploration strategy selects the learning action by setting the comparison size between the self-increased epsilon value and the random generation number r epsilon [0, 1], and the selection rule is as follows:
when r is less than epsilon, selecting the action with the maximum Q value in the current state,
when r > is epsilon, randomly selecting an action to execute in the current state;
ε∈[ε12]epsilon self-increment rule:
when the number of iterations n<=N1When epsilon is equal to epsilon1+(ε21)/N1N; when n is>N1When epsilon is equal to epsilon2
3. The timing scheme selection method according to claim 2, wherein the action is selected as a selection of 5 sets of schemes, and the action space is defined as: a ═ is (P1, P2, P3, P4, P5), where P1, P2, P3, P4, P5 are identical to P1, P2, P3, P4, P5 in the state space.
4. The timing scheme selection method of claim 3, wherein: the action space in the abnormal state and the action space in the normal state are set to be the same, and the action space covers the timing scheme space in the normal state and the abnormal state at the same time.
5. The timing scheme selection method of claim 4, wherein: the value of k is empirically set to 10.
CN202110856591.2A 2021-07-28 2021-07-28 Qspare-based single-point signal timing scheme selection method Active CN113506450B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110856591.2A CN113506450B (en) 2021-07-28 2021-07-28 Qspare-based single-point signal timing scheme selection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110856591.2A CN113506450B (en) 2021-07-28 2021-07-28 Qspare-based single-point signal timing scheme selection method

Publications (2)

Publication Number Publication Date
CN113506450A true CN113506450A (en) 2021-10-15
CN113506450B CN113506450B (en) 2022-05-17

Family

ID=78014271

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110856591.2A Active CN113506450B (en) 2021-07-28 2021-07-28 Qspare-based single-point signal timing scheme selection method

Country Status (1)

Country Link
CN (1) CN113506450B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115078798A (en) * 2022-07-26 2022-09-20 武汉格蓝若智能技术有限公司 Current range switching method and current collecting device

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105654744A (en) * 2016-03-10 2016-06-08 同济大学 Improved traffic signal control method based on Q learning
CN108335497A (en) * 2018-02-08 2018-07-27 南京邮电大学 A kind of traffic signals adaptive control system and method
CN108510764A (en) * 2018-04-24 2018-09-07 南京邮电大学 A kind of adaptive phase difference coordinated control system of Multiple Intersections and method based on Q study
US20180261085A1 (en) * 2017-03-08 2018-09-13 Fujitsu Limited Adjustment of a learning rate of q-learning used to control traffic signals
CN109035812A (en) * 2018-09-05 2018-12-18 平安科技(深圳)有限公司 Control method, device, computer equipment and the storage medium of traffic lights
CN109215355A (en) * 2018-08-09 2019-01-15 北京航空航天大学 A kind of single-point intersection signal timing optimization method based on deeply study
CN111081035A (en) * 2019-12-17 2020-04-28 扬州市鑫通智能信息技术有限公司 Traffic signal control method based on Q learning
CN111243271A (en) * 2020-01-11 2020-06-05 多伦科技股份有限公司 Single-point intersection signal control method based on deep cycle Q learning
CN112700664A (en) * 2020-12-19 2021-04-23 北京工业大学 Traffic signal timing optimization method based on deep reinforcement learning
CN112950963A (en) * 2021-01-25 2021-06-11 武汉工程大学 Self-adaptive signal control optimization method for main branch intersection of city
CN112991750A (en) * 2021-05-14 2021-06-18 苏州博宇鑫交通科技有限公司 Local traffic optimization method based on reinforcement learning and generation type countermeasure network

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105654744A (en) * 2016-03-10 2016-06-08 同济大学 Improved traffic signal control method based on Q learning
US20180261085A1 (en) * 2017-03-08 2018-09-13 Fujitsu Limited Adjustment of a learning rate of q-learning used to control traffic signals
CN108335497A (en) * 2018-02-08 2018-07-27 南京邮电大学 A kind of traffic signals adaptive control system and method
CN108510764A (en) * 2018-04-24 2018-09-07 南京邮电大学 A kind of adaptive phase difference coordinated control system of Multiple Intersections and method based on Q study
CN109215355A (en) * 2018-08-09 2019-01-15 北京航空航天大学 A kind of single-point intersection signal timing optimization method based on deeply study
CN109035812A (en) * 2018-09-05 2018-12-18 平安科技(深圳)有限公司 Control method, device, computer equipment and the storage medium of traffic lights
CN111081035A (en) * 2019-12-17 2020-04-28 扬州市鑫通智能信息技术有限公司 Traffic signal control method based on Q learning
CN111243271A (en) * 2020-01-11 2020-06-05 多伦科技股份有限公司 Single-point intersection signal control method based on deep cycle Q learning
CN112700664A (en) * 2020-12-19 2021-04-23 北京工业大学 Traffic signal timing optimization method based on deep reinforcement learning
CN112950963A (en) * 2021-01-25 2021-06-11 武汉工程大学 Self-adaptive signal control optimization method for main branch intersection of city
CN112991750A (en) * 2021-05-14 2021-06-18 苏州博宇鑫交通科技有限公司 Local traffic optimization method based on reinforcement learning and generation type countermeasure network

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
PÉTER PÁLOS: "Comparison of Q-Learning based Traffic Light Control Methods and Objective Functions", 《2020 INTERNATIONAL CONFERENCE ON SOFTWARE, TELECOMMUNICATIONS AND COMPUTER NETWORKS (SOFTCOM)》 *
YING LIU: "Intelligent traffic light control using distributed multi-agent Q learning", 《2017 IEEE 20TH INTERNATIONAL CONFERENCE ON INTELLIGENT TRANSPORTATION SYSTEMS (ITSC)》 *
张轮 等: "基于监督机制的城市交通信号多智能强化学习控制方法", 《交通与运输》 *
王祉祈 等: "基于Q-learning算法的单点信号控制研究", 《物流工程与管理》 *
胡宇 等: "基于Q学习的单路口交通信号协调控制", 《计算机与现代化》 *
郭梦杰 等: "基于强化学习的单路口信号控制算法", 《电子测量技术》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115078798A (en) * 2022-07-26 2022-09-20 武汉格蓝若智能技术有限公司 Current range switching method and current collecting device

Also Published As

Publication number Publication date
CN113506450B (en) 2022-05-17

Similar Documents

Publication Publication Date Title
CN109215355A (en) A kind of single-point intersection signal timing optimization method based on deeply study
CN113506450B (en) Qspare-based single-point signal timing scheme selection method
CN112216129B (en) Self-adaptive traffic signal control method based on multi-agent reinforcement learning
CN113094875B (en) Method and device for calibrating microscopic traffic simulation system in urban expressway interweaving area
CN106910337A (en) A kind of traffic flow forecasting method based on glowworm swarm algorithm Yu RBF neural
CN113223305B (en) Multi-intersection traffic light control method and system based on reinforcement learning and storage medium
JP2010134863A (en) Control input determination means of control object
CN114170789B (en) Intelligent network link lane change decision modeling method based on space-time diagram neural network
CN107506865A (en) A kind of load forecasting method and system based on LSSVM optimizations
CN113780624A (en) City road network signal coordination control method based on game equilibrium theory
CN106874555A (en) A kind of Reed Muller logic circuits power consumption and area-optimized method
CN109858559B (en) Self-adaptive traffic analysis road network simplification method based on traffic flow macroscopic basic graph
CN113657433B (en) Multi-mode prediction method for vehicle track
Hu et al. Lane-level navigation based eco-approach
CN111258314A (en) Collaborative evolution-based decision-making emergence method for automatic driving vehicle
CN114186709A (en) Energy prediction method for optimizing key parameters of gray model based on emperor butterfly algorithm
Meyer Convergence control in ACO
CN109752952A (en) Method and device for acquiring multi-dimensional random distribution and strengthening controller
WO2023178581A1 (en) Quantum-walk-based multi-scale feature parsing method for flow of online hailed vehicles
CN113188243B (en) Comprehensive prediction method and system for air conditioner energy consumption
CN113359449B (en) Aeroengine double-parameter index degradation maintenance method based on reinforcement learning
CN111581887B (en) Unmanned vehicle intelligent training method based on simulation learning in virtual environment
CN105677936A (en) Self-adaptive recursive multi-step prediction method of demand torque for mechanical-electrical compound drive system
CN115146499A (en) Cage radiator optimization design method based on GWO-SVM model
Jin et al. A multi-objective multi-agent framework for traffic light control

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP02 Change in the address of a patent holder
CP02 Change in the address of a patent holder

Address after: 311100 Room 108, Building 5, Pinggao Entrepreneurship City, Liangzhu Street, Yuhang District, Hangzhou City, Zhejiang Province

Patentee after: Zhejiang Haikang Zhilian Technology Co.,Ltd.

Address before: 314500 room 116, 1 / F, building 2, No.87 Hexi, Changfeng street, Wuzhen Town, Tongxiang City, Jiaxing City, Zhejiang Province

Patentee before: Zhejiang Haikang Zhilian Technology Co.,Ltd.