CN113506450A - Qspare-based single-point signal timing scheme selection method - Google Patents
Qspare-based single-point signal timing scheme selection method Download PDFInfo
- Publication number
- CN113506450A CN113506450A CN202110856591.2A CN202110856591A CN113506450A CN 113506450 A CN113506450 A CN 113506450A CN 202110856591 A CN202110856591 A CN 202110856591A CN 113506450 A CN113506450 A CN 113506450A
- Authority
- CN
- China
- Prior art keywords
- scheme
- action
- state
- value
- epsilon
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G08—SIGNALLING
- G08G—TRAFFIC CONTROL SYSTEMS
- G08G1/00—Traffic control systems for road vehicles
- G08G1/07—Controlling traffic signals
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G08—SIGNALLING
- G08G—TRAFFIC CONTROL SYSTEMS
- G08G1/00—Traffic control systems for road vehicles
- G08G1/01—Detecting movement of traffic to be counted or controlled
- G08G1/0104—Measuring and analyzing of parameters relative to traffic conditions
- G08G1/0125—Traffic data processing
-
- G—PHYSICS
- G08—SIGNALLING
- G08G—TRAFFIC CONTROL SYSTEMS
- G08G1/00—Traffic control systems for road vehicles
- G08G1/07—Controlling traffic signals
- G08G1/08—Controlling traffic signals according to detected number or speed of vehicles
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Abstract
A single-point signal timing scheme selection method based on Qlearning gives consideration to stability and flexibility of signal timing optimization, and explores and selects schemes from an upper safety search area and a lower safety search area of an intersection on the basis of an original fixed timing scheme in the intersection time period to realize stability of control; meanwhile, timely response is made to the relatively long-term slow or abnormal change in the time period, and the flexibility of control is achieved. And selecting a signal timing scheme matched with the current traffic environment state according to the finally obtained Q value table through continuous training.
Description
Technical Field
The invention relates to the field of traffic signal control, in particular to a Qlearning-based single-point signal timing scheme selection method.
Background
At present, intersection signal control often adopts a multi-period fixed timing scheme, and the multi-period timing scheme with the simple design can not adapt to traffic requirements along with long-term or short-term changes of traffic environment, so that unnecessary delay and even partial period congestion are caused. Therefore, it is necessary to optimize the scheme in real time in a time period, a common real-time optimization method does not perform learning according to feedback, and the calculation process is complex, or the change is too flexible and unsafe, which is not beneficial to implementation and flow operation, and cannot completely meet the requirement of dynamic traffic signal timing.
There are related patent cases in the prior art, such as:
according to the patent of the single-point signal control optimization method based on the intersection traffic record (patent application number: 201610971018.5), the green light utilization index is analyzed according to the traffic flow and the queuing condition, and if the green light time is left, the green light time is shortened. However, the method is only suitable for the condition of small traffic flow at the intersection, and a proper signal control scheme cannot be generated when sudden large flow occurs.
In the patent of 'intersection signal timing optimization method for reducing motor vehicle exhaust emission' (patent application number: 201510628335.2), a signal timing optimization model for minimizing the motor vehicle emission is constructed according to traffic flow theory and operation research, but the scheme needs to calibrate an emission factor and solve the model by adopting quadratic programming. The method is complex in implementation and calculation process and is not beneficial to practical application.
Disclosure of Invention
According to the problems brought forward by the background technology, the invention provides a Qlearning-based single-point signal timing scheme selection method, which distinguishes conventional and abnormal states through distinguishing traffic environment states in a time period, selects and executes timing scheme action behaviors in corresponding states, acts on the current traffic environment, analyzes intersection states, and gives corresponding reward or punishment feedback according to the states, wherein the reward or punishment is used for strengthening the mapping relation between the environment states and the optimal scheme selection. By repeatedly performing this mapping process, the learning model can obtain the ability to select the best solution under normal and abnormal environmental conditions over a period of time. The invention will be further elucidated below.
S1, state space definition;
to be able to describe both the normal state and the abnormal state, the state space is defined as S ═ C, F, where C represents the state set and F represents the state switches.
In order to enable the invention to quickly converge and to quickly respond to changes in traffic environment, the state set is simply designed. Selecting a fixed timing scheme running in a certain time period as a reference scheme, and expanding l-1 sets of schemes in the upper and lower directions of the reference scheme respectively, wherein the value of l is selected according to the actual application condition. The state set C provides a total of 2l-1 sets of solutions, C ═ P1, P2, … Pl, …, P (2l-1)), where Pl is the original baseline solution, P1 is the downward extension 1, P (Pl-1) is the downward extension Pl-1, and P (2l-1) is the upward extension 2 l-1.
In order to distinguish whether the traffic state is abnormal or not, a switching value F for normal and abnormal states is set:
in the formula, y is a key flow ratio of the intersection. Assume that the intersection is at the j phase, qiflow of critical traffic, s, for phase iiIs the saturated flow of the lane. y isnowAs the current key flow ratio, ylastThe time interval fair critical flow ratio, obtained for data analysis, with ylastThe increase of the same ratio over e can be judged as an abnormal state, and e can be set according to the actual intersection condition.
S2, defining an action space;
in single-point signal timing optimization, a complete motion space includes all possible motions of the intersection within a time step, i.e., all possible signal timing schemes. Considering that the convergence speed of the algorithm is influenced by too large motion space, the motion is simplified into the selection of w sets of schemes. The action space is defined as: and a is (a1, a2, … am, …, aw), wherein am is the mth set of signal timing scheme in the motion space. In each scheme, the periods are different; the phase duration in each scheme can be distributed and adjusted according to the flow ratio of the key traffic flow of each phase.
In order to simplify the algorithm, the action space in the abnormal state and the action space in the normal state are set to be the same, and the action space needs to be simultaneously covered to the timing scheme space in the normal state and the abnormal state; in practical application, the action spaces can be set according to the normal state and the abnormal state respectively, and the action spaces of the normal state and the abnormal state can be set to be different.
S3, a reward function;
the return function can be obtained by calculating the index values of delay time, parking times, queuing length and the like, the index values can be directly obtained through simulation software, and the average delay of vehicles at the intersection is selected as an evaluation index.
Firstly, analyzing and obtaining upper limit values d of different types of delay variation ranges in the time interval of the intersection through a clustering algorithm, wherein as shown in fig. 3, the ordinate is a clustering result type value, the abscissa is an average vehicle delay value, and the unit is s; : "x" on category 0 represents a clustering center point of a normal delay category, and "x" represents an upper limit of a normal delay value on 80% of the quantiles; on the ordinate, "″" represents the cluster center point of the abnormal delay category "x" represents the upper limit of the abnormal delay value in 80% of the divisions "1".
The reward and punishment function is:
in the formula: dt0For delays before the execution of an action, dtkIs a delay after the action is performed.
In order to prevent delay mutation and oscillation of a reward and punishment function caused by the fluctuation of traffic, a continuous same action mark b is set, and if the same action is continuously carried out twice, b is 2; if the same action is performed three times continuously, b is 3; and so on, adding 1 to the value of b every time the same continuous action is added; if the continuous operation is interrupted, b is 1.
For different b and rt(s, a), dif adjustment feedback rt(s, a), the rule is as follows:
dif=dtk-dt0
when b is 2 and rtWhen the (s, a) — 1, the selected scheme is selected for the second time, and the action selection strategy of the present invention adopts a greedy algorithm to know that the selected scheme is a relatively excellent set of schemes, perhaps due to the volatility of traffic, resulting in a delay increase. When the delay does not rise much, i.e. dif<k is, r can be correctedt(s, a) ═ 0; when the rise amplitude of the delay is larger, namely dif is more than or equal to k, r can be keptt(s, a) ═ 1. The value of k can be set empirically.
When b is>2 and rtWhen the (s, a) — 1, it is indicated that the selected scheme has been selected three or more times, and it is known that the selected scheme is an excellent scheme, and the delay may be increased due to the fluctuation of traffic, or a change in traffic environment. When the delay does not rise much, i.e. dif<k is, r can be maintainedt(s, a) ═ -1; when the delay rise amplitude is large, namely dif is larger than or equal to k, the r can be correctedtAnd (s, a) — b +1, enhancing the feedback value of the environmental change.
rt(s, a) ═ 2, when rt(s, a) ═ 1 and b ═ 2;
when r istWhen (s, a) is 2 and b is 2, resetting b to 1 prevents the occurrence of correction r as b increases immediately after the same action occurstWhen (s, a) — 1, even a small negative value, strong oscillation occurs, resulting in a situation of non-convergence.
S4, updating the Q value table;
updating the Q value selects a Bellman optimal equation:
Qt+1(st,at)=(1-αt)Qt(st,at)+αt(rt+1+γmaxQt(st+1,at+1));
in the invention, two Q value tables are required to be established, one is used for recording conventional traffic, the other is used for recording abnormal traffic, and other parameters can be uniformly set. Alpha is the learning rate and gamma is the discount factor, the greater the learning rate, the less effective it is to retain the previous training. The larger the discount factor gamma, the greater the effect of the previous training. Alpha and gamma can be determined according to the characteristics of a specific intersection.
The action selection strategy adopts greedy algorithm, namely epsilon-greedy exploration strategy, and the self-increasing epsilon value and the random generation number r are set to be epsilon [0, 1 ∈]And selecting the learning action by comparing the sizes. Selecting a rule: when r is<When epsilon is generated, selecting the action with the maximum Q value in the current state; when r is>When the current state is equal to epsilon, an action is randomly selected to be executed. E is epsilon [ epsilon ]1,ε2]Epsilon self-increment rule: when the number of iterations n<=N1When epsilon is equal to epsilon1+(ε2-ε1)/N1N; when n is>N1When epsilon is equal to epsilon2。
According to the Qlearning rule, the Q value table is a matrix of (2l-1) xw, and the Q values of different behaviors in each state are updated in iteration according to the Bellman equation. The method aims to enable the Q value of the optimal behavior in each state to obtain the maximum value, so that the selection probability of the optimal behavior is higher and higher, the probability of the non-optimal behavior is lower and lower, and after the Q value matrix is finally converged, the optimal behavior can be selected with high probability in each state.
Has the advantages that: compared with the prior art, the method comprises the steps that 1) on the basis of an original fixed timing scheme in an intersection time period, the scheme is explored and selected from an upper safety search area and a lower safety search area of the intersection, and the stability of control is realized; 2) meanwhile, timely response is made to the relatively long-term slow or abnormal change in the time period, and the flexibility of control is reflected. The stability and the flexibility of signal timing optimization are considered, and the purpose of stably and flexibly improving the running condition of the intersection is achieved.
Drawings
FIG. 1: the invention discloses a schematic diagram of a timing scheme selection method;
FIG. 2: the scheme of the invention selects a schematic diagram;
FIG. 3: the vehicles at the intersections delay the clustering result chart;
FIG. 4: q value table initial value chart;
FIG. 5: training a Q value table in convergence;
FIG. 6: the algorithm is compared with a fixed time delay contrast curve.
Detailed Description
A specific embodiment of the present invention will be described in detail with reference to the accompanying drawings.
A single-point signal timing scheme selection method based on Qlearning gives consideration to stability and flexibility of signal timing optimization, and explores and selects schemes from an upper safety search area and a lower safety search area of an intersection on the basis of an original fixed timing scheme in the intersection time period to realize stability of control; meanwhile, timely response is made to the relatively long-term slow or abnormal change in the time period, and the flexibility of control is achieved. And selecting a signal timing scheme matched with the current traffic environment state according to the finally obtained Q value table through continuous training.
The method comprises the following steps:
the state space is defined as S ═ C, F, where C represents the state set and F represents the state switches.
For convenience of understanding, a three-phase intersection on an urban main road is taken as an example, a fixed timing scheme running in a certain time period at the intersection is taken as a reference scheme, two sets of schemes are respectively expanded in the upper direction and the lower direction of the intersection, a state set C is provided with 5 sets of schemes, and C is (P1, P2, P3, P4 and P5), wherein P3 is an original reference scheme, P1 is a downward expansion scheme 1, P2 is a downward expansion scheme 2, P4 is an upward expansion scheme 4, and P5 is an upward expansion scheme 5.
Setting display of 5 sets of schemes in a state set: p3Each phase duration of the scheme is respectively set toCycle is Cycle3(ii) a P1 is expansion scheme 1, and each phase duration is set to beCycle is Cycle1(ii) a P2 expansion 2, each phase duration is set separatelyCycle is Cycle2(ii) a P4 expansion scheme 4, each phase duration is set toCycle is Cycle4(ii) a P5 expansion 5, each phase duration is set separatelyCycle is Cycle5。
In order to distinguish whether the traffic state is abnormal or not, the present embodiment sets a switching amount F for the normal and abnormal states.
In the formula, y is a key flow ratio of the intersection, taking a three-phase intersection as an example, and y is (q)1+q2+q3)/s,q1、q2、q3The flow rates of the key traffic flows in the phase 1, the phase 2, and the phase 3, respectively, and s is the saturation flow rate of the lane, and it is assumed here that the saturation flow rates of the lanes are the same value. y isnowAs the current key flow ratio, ylastThe time interval fair critical flow ratio, obtained for data analysis, with ylastThe increase of the same ratio over e can be judged as an abnormal state, and e can be set according to the actual intersection condition.
considering that the convergence speed of the algorithm is influenced by too large motion space, the motion is simplified into the selection of 5 sets of schemes. The action space is defined as: a ═ is (P1, P2, P3, P4, P5), where P1, P2, P3, P4, P5 are identical to P1, P2, P3, P4, P5 in the state space. The action space in the abnormal state and the action space in the normal state are set to be the same, and the action space needs to be simultaneously covered to the timing scheme space in the normal state and the abnormal state.
selecting average delay of vehicles at the intersection as an evaluation index;
firstly, analyzing and obtaining upper limit values d of different types of delay variation ranges in the intersection within the time interval through a clustering algorithm, wherein as shown in fig. 3, a normal delay upper limit d is 44s, and an abnormal delay upper limit d is 66 s;
the reward and punishment function is:
in the formula: dt0For delays before the execution of an action, dtkIs a delay after the action is performed.
In order to prevent the delay mutation and the oscillation of the reward and punishment function caused by the fluctuation of the traffic, a continuous same action mark b is set: if the same action is carried out twice continuously, b is 2; if the same action is performed three times continuously, b is 3; and so on, adding 1 to the value of b every time the same continuous action is added; if the continuous action is interrupted, b is 1;
for different b and rt(s, a), dif adjustment feedback rt(s, a), the rule is as follows (here k is set to 10):
dif=dtk-dt0
when b is 2 and rtWhen (s, a) — 1, the selected scheme is selected for the second time, and the action selection strategy of the algorithm adopts a greedy algorithm to know that the selected scheme is a set of relatively excellent schemesThe solution of the show, perhaps due to the volatility of traffic, leads to a rise in delay: when the delay does not rise much, i.e. dif<When 10, can correct rt(s, a) ═ 0; when the rise amplitude of the delay is larger, namely dif is more than or equal to 10, r can be keptt(s,a)=-1。
When b is>2 and rtWhen (s, a) — 1, it is described that the selected scheme has been selected three or more times, it is known that the selected scheme is an excellent scheme, and the delay is increased due to the fluctuation of traffic, or the change of traffic environment: when the delay does not rise much, i.e. dif<At 10 hours, can maintain rt(s, a) ═ -1; when the delay rise amplitude is large, namely dif is more than or equal to 10, the correction r can be correctedtAnd (s, a) — b +1, enhancing the feedback value of the environmental change.
rt(s, a) ═ 2, when rt(s, a) ═ 1 and b ═ 2;
when r istWhen (s, a) is 2 and b is 2, resetting b to 1 prevents the occurrence of correction r as b increases immediately after the same action occurstWhen (s, a) — 1, even a small negative value, strong oscillation occurs, resulting in a situation of non-convergence.
updating the Q value selects a Bellman optimal equation:
Qt+1(st,at)=(1-αt)Qt(st,at)+αt(rt+1+γmaxQt(st+1,at+1));
N1the value is 500, epsilon is equal to [0.7,0.9 ]]Epsilon self-increment rule: when the number of iterations n<When 500, epsilon is 0.7+0.2/500 x n; when n is>At 500, ε is 0.9.
The Q values are expressed as a5 × 5 matrix, and as shown in fig. 4, the row symbols s1, s2, s3, s4, and s5 represent 5 states (for ease of understanding, the state in the conventional reinforcement learning is used to represent the letter s, and the corresponding states used herein are p1, p2, p3, p4, and p5), and the lists a1, a2, a3, a4, and a5 represent 5 actions (for ease of understanding, the action in the conventional reinforcement learning is used to represent the letter a, and the corresponding actions used herein are p1, p2, p3, p4, and p 5).
The single point scheme selection method of the present invention is further illustrated below with reference to examples:
example 1: take an intersection in a city of a province as an example.
Setting display of 5 sets of schemes in a state set: the duration of each phase of the P3 scheme is respectively set to 54s, 34s and 44s, and the period is 132 s; p1 is expansion scheme 1, the duration of each phase is set to 30s, 24s, 32s, respectively, and the period is 86 s; p2 is expansion 2, the duration of each phase is set to 43s, 29s, 40s, respectively, and the period is 112 s; p4 is expansion scheme 4, the duration of each phase is set to 56s, 35s, 46s, respectively, and the period is 137 s; p5 is expansion 5, and each phase duration is set to 59s, 36s, 47s, respectively, and the period is 142 s.
The flow statistics and lane number in the time period are shown in the following table:
an inlet | East import | South import | West imported goods | North import |
Flow rate | 300 | 800 | 300 | 400 |
Number of |
2 | 4 | 2 | 4 |
The initial value setting of the Q value table of the algorithm program at the single-point intersection is shown in fig. 4, the Q value table is updated by running the algorithm code, and the learning result is also a5 × 5 matrix, as shown in fig. 5.
The intersection scheme of the algorithm is switched continuously according to the learning rule in iteration, and the Q value table with different convergence conditions can be obtained by setting values of different learning times N. The larger the value of N is, the better the convergence condition of the Q value table is, but the corresponding time consumption is increased. In the experiment, N is 540, and the total average delay D is counted for 30 times of iterationiThe total average delay was compared for a total of 18 statistical (540/30 ═ 18) iterations 540.
Where m is the number of iterations (m ∈ [1,540 ]]) And i is a statistic degree mark (i belongs to [1,18 ]]),For each iteration of the vehicle delay, DiThe sum of the delays for the ith 30 iteration cars.
Fig. 6 is a graph comparing the delay of the present invention with a fixed time, with an iteration number index i on the abscissa and a total vehicle mean delay D _ i on the ordinate, as shown in the figure,
compared with other methods, the method has the advantages that after convergence, the normal Q value table or the abnormal Q value table can quickly respond to the long-term slow change of the traffic environment according to the return function and the selection strategy formulated by the method. Because the optimal behavior after convergence is continuously selected approximately, if the optimal behavior is not matched with the traffic environment at the moment, the optimal behavior is continuously selected and subjected to continuously-increased punishment, and the selection probability is rapidly reduced until a new convergence state is reached.
The invention can give consideration to the stability and flexibility of signal timing optimization, and explores and selects the schemes to the upper and lower safety search areas of the intersection based on the original fixed timing scheme in the time interval of the intersection, thereby realizing the stability of control; meanwhile, timely response is made to the relatively long-term slow or abnormal change in the time period, and the flexibility of control is reflected. Therefore, the aim of stably and flexibly improving the running condition of the intersection is finally achieved.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (5)
1. A single-point signal timing scheme selection method based on Qlearning is characterized by comprising the following steps:
s1, state space definition;
defining a state space as S ═ C, F, where C represents a state set and F represents a state switch;
simplifying the state set, selecting a fixed timing scheme running in a certain time period as a reference scheme Pl, expanding l-1 sets of schemes in the upper and lower directions of the reference scheme Pl, wherein the value of l is selected according to the actual application condition, the state set C is provided with 2l-1 sets of schemes in total, C is (P1, P2, … Pl, … and P (2l-1)), wherein Pl is an original reference scheme, P1 is a downward expansion scheme 1, P (l-1) is a downward expansion scheme l-1, and P (2l-1) is an upward expansion scheme 2 l-1;
setting a switching value F to distinguish whether the traffic state is abnormal or not:
in the formula, y is the key flow ratio of the intersection, the intersection is the j phase, and q isiFlow of critical traffic, s, for phase iiIs the saturated flow of the lane, ynowAs the current key flow ratio, ylastThe time interval fair key flow ratio is obtained according to historical data analysis, and ylastIf the increase of the same ratio exceeds e, the abnormal state can be judged, and e can be set according to the actual intersection condition;
s2, defining an action space;
a complete action space comprises all possible signal timing schemes of the intersection within a time step, and is defined as: a ═ (a1, a2, … am, …, aw);
wherein am is the mth set of signal timing scheme in the motion space; the periods in each scheme are different, and the phase duration in each scheme can be distributed and adjusted according to the flow ratio of the key traffic flow of each phase;
s3, a reward function;
the return function is obtained by calculation according to delay time which is directly obtained by simulation software or calculated in practical application;
firstly, analyzing and obtaining the upper limit value d of different types of delay variation ranges in the time interval of the intersection through a clustering algorithm;
next, define the reward and penalty function as:
in the formula: dt0For delays before the execution of an action, dtkDelay after action execution;
setting a continuous same action flag b, and if two continuous same actions are carried out, setting b as 2; if the same action is performed three times continuously, b is 3; and so on, adding 1 to the value of b every time the same continuous action is added; if the continuous action is interrupted, b is 1;
for different b and rt(s, a), dif adjustment feedback rt(s, a), the rule is as follows:
dif=dtk-dt0;
when b is 2 and rtWhen (s, a) — 1, it means that the selected scheme has been selected for the second time, and the action selection strategy adopts a greedy algorithm, and it can be known that the selected scheme is a relatively excellent set of schemes; when the delay does not rise much, i.e. dif<k is, r is correctedt(s, a) ═ 0; when the rise amplitude of the delay is larger, namely dif is more than or equal to k, r is kepttThe value of (s, a) — 1, k can be set empirically;
when b is>2 and rtWhen the (s, a) — 1, it is indicated that the selected scheme has been selected three or more times in succession, and it is known that the selected scheme is a relatively excellent scheme; when the delay does not rise much, i.e. dif<k is, r is heldt(s, a) ═ -1; when the rise amplitude of the delay is larger, namely dif is more than or equal to k, r is correctedt(s, a) — b +1, enhancing the feedback value of the environmental change;
rt(s, a) ═ 2, when rt(s, a) ═ 1 and b ═ 2;
when r istWhen (s, a) is 2 and b is 2, resetting b to 1 prevents the occurrence of correction r as b increases immediately after the same action occurst(s, a) — 1, even smaller negative values, appear strongly oscillating, non-converging.
S4, updating the Q value table;
establishing two Q value tables, wherein one Q value table is used for recording conventional traffic, the other Q value table is used for recording abnormal traffic, other parameters are uniformly set, and the Q value is updated by selecting a Bellman optimal equation:
Qt+1(st,at)=(1-αt)Qt(st,at)+αt(rt+1+γmaxQt(st+1,at+1));
wherein alpha is a learning rate, gamma is a discount factor, and alpha and gamma are determined according to specific intersection characteristics;
the action selection strategy is an epsilon-greedy exploration strategy, the Q value table is a matrix of (2l-1) xw according to the Qlerning rule, and the Q values of different behaviors in each state are updated in iteration according to the Bellman equation.
2. The timing scheme selection method according to claim 1, wherein the epsilon-greedy exploration strategy selects the learning action by setting the comparison size between the self-increased epsilon value and the random generation number r epsilon [0, 1], and the selection rule is as follows:
when r is less than epsilon, selecting the action with the maximum Q value in the current state,
when r > is epsilon, randomly selecting an action to execute in the current state;
ε∈[ε1,ε2]epsilon self-increment rule:
when the number of iterations n<=N1When epsilon is equal to epsilon1+(ε2-ε1)/N1N; when n is>N1When epsilon is equal to epsilon2。
3. The timing scheme selection method according to claim 2, wherein the action is selected as a selection of 5 sets of schemes, and the action space is defined as: a ═ is (P1, P2, P3, P4, P5), where P1, P2, P3, P4, P5 are identical to P1, P2, P3, P4, P5 in the state space.
4. The timing scheme selection method of claim 3, wherein: the action space in the abnormal state and the action space in the normal state are set to be the same, and the action space covers the timing scheme space in the normal state and the abnormal state at the same time.
5. The timing scheme selection method of claim 4, wherein: the value of k is empirically set to 10.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110856591.2A CN113506450B (en) | 2021-07-28 | 2021-07-28 | Qspare-based single-point signal timing scheme selection method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110856591.2A CN113506450B (en) | 2021-07-28 | 2021-07-28 | Qspare-based single-point signal timing scheme selection method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113506450A true CN113506450A (en) | 2021-10-15 |
CN113506450B CN113506450B (en) | 2022-05-17 |
Family
ID=78014271
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110856591.2A Active CN113506450B (en) | 2021-07-28 | 2021-07-28 | Qspare-based single-point signal timing scheme selection method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113506450B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115078798A (en) * | 2022-07-26 | 2022-09-20 | 武汉格蓝若智能技术有限公司 | Current range switching method and current collecting device |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105654744A (en) * | 2016-03-10 | 2016-06-08 | 同济大学 | Improved traffic signal control method based on Q learning |
CN108335497A (en) * | 2018-02-08 | 2018-07-27 | 南京邮电大学 | A kind of traffic signals adaptive control system and method |
CN108510764A (en) * | 2018-04-24 | 2018-09-07 | 南京邮电大学 | A kind of adaptive phase difference coordinated control system of Multiple Intersections and method based on Q study |
US20180261085A1 (en) * | 2017-03-08 | 2018-09-13 | Fujitsu Limited | Adjustment of a learning rate of q-learning used to control traffic signals |
CN109035812A (en) * | 2018-09-05 | 2018-12-18 | 平安科技(深圳)有限公司 | Control method, device, computer equipment and the storage medium of traffic lights |
CN109215355A (en) * | 2018-08-09 | 2019-01-15 | 北京航空航天大学 | A kind of single-point intersection signal timing optimization method based on deeply study |
CN111081035A (en) * | 2019-12-17 | 2020-04-28 | 扬州市鑫通智能信息技术有限公司 | Traffic signal control method based on Q learning |
CN111243271A (en) * | 2020-01-11 | 2020-06-05 | 多伦科技股份有限公司 | Single-point intersection signal control method based on deep cycle Q learning |
CN112700664A (en) * | 2020-12-19 | 2021-04-23 | 北京工业大学 | Traffic signal timing optimization method based on deep reinforcement learning |
CN112950963A (en) * | 2021-01-25 | 2021-06-11 | 武汉工程大学 | Self-adaptive signal control optimization method for main branch intersection of city |
CN112991750A (en) * | 2021-05-14 | 2021-06-18 | 苏州博宇鑫交通科技有限公司 | Local traffic optimization method based on reinforcement learning and generation type countermeasure network |
-
2021
- 2021-07-28 CN CN202110856591.2A patent/CN113506450B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105654744A (en) * | 2016-03-10 | 2016-06-08 | 同济大学 | Improved traffic signal control method based on Q learning |
US20180261085A1 (en) * | 2017-03-08 | 2018-09-13 | Fujitsu Limited | Adjustment of a learning rate of q-learning used to control traffic signals |
CN108335497A (en) * | 2018-02-08 | 2018-07-27 | 南京邮电大学 | A kind of traffic signals adaptive control system and method |
CN108510764A (en) * | 2018-04-24 | 2018-09-07 | 南京邮电大学 | A kind of adaptive phase difference coordinated control system of Multiple Intersections and method based on Q study |
CN109215355A (en) * | 2018-08-09 | 2019-01-15 | 北京航空航天大学 | A kind of single-point intersection signal timing optimization method based on deeply study |
CN109035812A (en) * | 2018-09-05 | 2018-12-18 | 平安科技(深圳)有限公司 | Control method, device, computer equipment and the storage medium of traffic lights |
CN111081035A (en) * | 2019-12-17 | 2020-04-28 | 扬州市鑫通智能信息技术有限公司 | Traffic signal control method based on Q learning |
CN111243271A (en) * | 2020-01-11 | 2020-06-05 | 多伦科技股份有限公司 | Single-point intersection signal control method based on deep cycle Q learning |
CN112700664A (en) * | 2020-12-19 | 2021-04-23 | 北京工业大学 | Traffic signal timing optimization method based on deep reinforcement learning |
CN112950963A (en) * | 2021-01-25 | 2021-06-11 | 武汉工程大学 | Self-adaptive signal control optimization method for main branch intersection of city |
CN112991750A (en) * | 2021-05-14 | 2021-06-18 | 苏州博宇鑫交通科技有限公司 | Local traffic optimization method based on reinforcement learning and generation type countermeasure network |
Non-Patent Citations (6)
Title |
---|
PÉTER PÁLOS: "Comparison of Q-Learning based Traffic Light Control Methods and Objective Functions", 《2020 INTERNATIONAL CONFERENCE ON SOFTWARE, TELECOMMUNICATIONS AND COMPUTER NETWORKS (SOFTCOM)》 * |
YING LIU: "Intelligent traffic light control using distributed multi-agent Q learning", 《2017 IEEE 20TH INTERNATIONAL CONFERENCE ON INTELLIGENT TRANSPORTATION SYSTEMS (ITSC)》 * |
张轮 等: "基于监督机制的城市交通信号多智能强化学习控制方法", 《交通与运输》 * |
王祉祈 等: "基于Q-learning算法的单点信号控制研究", 《物流工程与管理》 * |
胡宇 等: "基于Q学习的单路口交通信号协调控制", 《计算机与现代化》 * |
郭梦杰 等: "基于强化学习的单路口信号控制算法", 《电子测量技术》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115078798A (en) * | 2022-07-26 | 2022-09-20 | 武汉格蓝若智能技术有限公司 | Current range switching method and current collecting device |
Also Published As
Publication number | Publication date |
---|---|
CN113506450B (en) | 2022-05-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109215355A (en) | A kind of single-point intersection signal timing optimization method based on deeply study | |
CN113506450B (en) | Qspare-based single-point signal timing scheme selection method | |
CN112216129B (en) | Self-adaptive traffic signal control method based on multi-agent reinforcement learning | |
CN113094875B (en) | Method and device for calibrating microscopic traffic simulation system in urban expressway interweaving area | |
CN106910337A (en) | A kind of traffic flow forecasting method based on glowworm swarm algorithm Yu RBF neural | |
CN113223305B (en) | Multi-intersection traffic light control method and system based on reinforcement learning and storage medium | |
JP2010134863A (en) | Control input determination means of control object | |
CN114170789B (en) | Intelligent network link lane change decision modeling method based on space-time diagram neural network | |
CN107506865A (en) | A kind of load forecasting method and system based on LSSVM optimizations | |
CN113780624A (en) | City road network signal coordination control method based on game equilibrium theory | |
CN106874555A (en) | A kind of Reed Muller logic circuits power consumption and area-optimized method | |
CN109858559B (en) | Self-adaptive traffic analysis road network simplification method based on traffic flow macroscopic basic graph | |
CN113657433B (en) | Multi-mode prediction method for vehicle track | |
Hu et al. | Lane-level navigation based eco-approach | |
CN111258314A (en) | Collaborative evolution-based decision-making emergence method for automatic driving vehicle | |
CN114186709A (en) | Energy prediction method for optimizing key parameters of gray model based on emperor butterfly algorithm | |
Meyer | Convergence control in ACO | |
CN109752952A (en) | Method and device for acquiring multi-dimensional random distribution and strengthening controller | |
WO2023178581A1 (en) | Quantum-walk-based multi-scale feature parsing method for flow of online hailed vehicles | |
CN113188243B (en) | Comprehensive prediction method and system for air conditioner energy consumption | |
CN113359449B (en) | Aeroengine double-parameter index degradation maintenance method based on reinforcement learning | |
CN111581887B (en) | Unmanned vehicle intelligent training method based on simulation learning in virtual environment | |
CN105677936A (en) | Self-adaptive recursive multi-step prediction method of demand torque for mechanical-electrical compound drive system | |
CN115146499A (en) | Cage radiator optimization design method based on GWO-SVM model | |
Jin et al. | A multi-objective multi-agent framework for traffic light control |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CP02 | Change in the address of a patent holder | ||
CP02 | Change in the address of a patent holder |
Address after: 311100 Room 108, Building 5, Pinggao Entrepreneurship City, Liangzhu Street, Yuhang District, Hangzhou City, Zhejiang Province Patentee after: Zhejiang Haikang Zhilian Technology Co.,Ltd. Address before: 314500 room 116, 1 / F, building 2, No.87 Hexi, Changfeng street, Wuzhen Town, Tongxiang City, Jiaxing City, Zhejiang Province Patentee before: Zhejiang Haikang Zhilian Technology Co.,Ltd. |