CN106097733B

CN106097733B - A kind of traffic signal optimization control method based on Policy iteration and cluster

Info

Publication number: CN106097733B
Application number: CN201610696748.9A
Authority: CN
Inventors: 王冬青; 张震; 董心壮; 丁军航; 宋婷婷
Original assignee: Qingdao University
Current assignee: Qingdao University
Priority date: 2016-08-22
Filing date: 2016-08-22
Publication date: 2018-12-07
Anticipated expiration: 2036-08-22
Also published as: CN106097733A

Abstract

The present invention proposes that a kind of traffic signal optimization control method based on Policy iteration and cluster, this method are related to Intelligent Optimization Technique field, comprising: step 1, select control program, define traffic behavior, control action, immediate yield andQValue；Step 2, induction control traffic lights, record the traffic behavior, control action and the vehicle number for leaving stop line of each sampling instant；Step 3, traffic behavior is pre-processed, then carries out k mean cluster；Step 4, Policy iteration method optimisation strategy is used in the machine of crossing, and mass center obtained in the strategy and step 3 that optimization obtains is stored in traffic signal control；Step 5, the control strategy substitution induction control obtained using step 4, in the initial time in each sampling period, traffic signal control receives the traffic behavior of crossing machine acquisition, control strategy is inquired according to the corresponding discrete state of mass center, obtain control action and is sent to the execution of crossing machine.

Description

A kind of traffic signal optimization control method based on Policy iteration and cluster

Technical field

The present invention relates to Intelligent Optimization Technique fields.

Background technique

The optimal control of traffic signals is the important component of urban traffic control and control system, traffic signalization The superiority and inferiority of strategy directly affects the conevying efficiency of entire road network and the trip experience of people, therefore, various intelligent optimal control sides Method is suggested and is attempted the optimization applied to traffic signal control strategy.

Dynamic Programming is a kind of method for solving optimal control policy, including two methods of value iteration and Policy iteration.It is right Tactful traffic behavior, phase and immediate yield are sampled, and are then advanced optimized using sample to control strategy, thus very suitable It closes and solves traffic signal optimization control problem.When carrying out Policy iteration to traffic signalization problem, need vehicle queue The continuous variables such as length carry out discretization.Traditional discretization method is that entire state space is carried out to uniform division, and practical The state of appearance is only gathered in some regions of state space, therefore, is carried out using k- mean cluster to the region that state is assembled It divides, can guarantee higher discretization precision under conditions of using same number discrete state, to improve the effect of optimization Fruit.

Summary of the invention

The purpose of the present invention is using k- mean cluster to carry out discretization to traffic behavior, to improve the optimization of Policy iteration Effect, the preferably control strategy of optimization of road joints traffic lights.Final purpose is to increase and pass through crossing in the unit time Vehicle number, and reduce because wait red light caused by stop frequency and mean delay.

The present invention is first controlled using the induction control method oral sex messenger that satisfies the need, every one section of shorter unit time Interval, the vehicle number and traffic signals that crossing machine records the vehicle queue length of current phase and next phase, leaves stop line The control action of controller.After crossing machine collects enough samples, it is poly- that k- mean value is carried out to the vehicle queue length in sample Class obtains discrete traffic behavior.Then strategy is optimized using Policy iteration, and the strategy optimized is stored in traffic In signal controller.Every one section of shorter unit interval, crossing machine is the current phase and next phase detected Vehicle queue length is sent to traffic signal control, what traffic signal control was kept according to vehicle queue length and in advance Optimisation strategy selects suitable phase movement, executes for crossing machine.

The present invention proposes a kind of traffic signal optimization control method based on Policy iteration and cluster, comprising the following steps:

Step 1, select signal timing plan to be optimized for fixed phase sequence control, define traffic behavior be current phase and The vehicle queue length of next phase, defining control action is to keep current phase or be switched to next phase, and definition is directly returned Report is a variable related with the vehicle number of stop line is left in the single sampling period, and definition status-movement is to for discrete friendship The data vector of logical state and control action composition, the Q value for defining each state-movement pair are indicated in corresponding discrete traffic shape The expectation obtained after control action accumulation return is taken under state, defining each discrete traffic behavior of control strategy should execute Control action；

Step 2, the control strategy of traffic signal control is set as induction control, minimum green time, most by crossing machine Big green time is set as the positive integer times in sampling period, and unit green extension is identical as the sampling period, and crossing machine is to friendship The vehicle number that logical state, the phase of execution acted and left stop line is sampled and is recorded sample, the method for sampling are as follows: each Sampling instant recording traffic state, control action and each sampling period leave the vehicle number of stop line；

Step 3, after crossing machine collects the sample specified number, discretization is carried out to the traffic behavior in sample, it is discrete Change method are as follows: the traffic behavior first obtained to sampling is normalized, and removes the traffic behavior that spacing is more than preset threshold, K- mean cluster is carried out again, and obtained mass center is numbered, the corresponding discrete traffic behavior of each mass center, and normalizing The traffic behavior changed in sample is indicated with the number of nearest mass center, obtains corresponding discrete traffic behavior；

Step 4, crossing machine uses Policy iteration optimisation strategy, mass center obtained in the strategy and step 3 that optimization is obtained It is stored in traffic signal control；

Step 5, the control strategy of crossing machine setting traffic signal control is the control strategy that step 4 obtains, and handle is determined The plan period is set as the sampling period, and at each decision moment, traffic signal control receives the traffic behavior that crossing machine examination measures, It is normalized, the traffic behavior after calculating normalization is found out to the distance of each mass center apart from nearest mass center, according to mass center Corresponding discrete traffic behavior inquires control strategy, obtains control action and is sent to the execution of crossing machine.

The present invention is compared with advantage possessed by the prior art:

Before using Policy iteration optimization traffic signal control strategy, need first to carry out discretization to traffic behavior --- The continuous state space that the vehicle queue length of two phases is constituted is converted into separate manufacturing firms, the precision of discretization can shadow Ring the effect of optimization of Policy iteration.In different typical period of time, actual traffic behavior is not dispersed in entire state space, and It is to concentrate on some regions.The traffic behavior actually occurred is only considered using the discrete traffic behavior that k- means clustering algorithm obtains The region of concentration is also taken into account the region there is no actual traffic state like that rather than conventional discrete method.Thus, It is compared with the traditional method, after k- means clustering algorithm, can be obtained using equal number of discrete traffic behavior higher Discretization precision, to improve the effect of optimization of Policy iteration.

Detailed description of the invention

Fig. 1 is urban road intersection traffic signalization schematic diagram.

Fig. 2 is a kind of traffic signal optimization control method flow chart based on Policy iteration and cluster.

1, the first earth magnetism wagon detector；2, the second earth magnetism wagon detector；3, third earth magnetism wagon detector；4, the 4th Earth magnetism wagon detector；5, the 5th earth magnetism wagon detector；6, the 6th earth magnetism wagon detector；7, the 7th earth magnetism vehicle detection Device；8, the 8th earth magnetism wagon detector；9, the 9th earth magnetism wagon detector；10, the tenth earth magnetism wagon detector；11, the ten one Earth magnetism wagon detector；12, the 12nd earth magnetism wagon detector；13, the 13rd earth magnetism wagon detector；14, the 14th earth magnetism Wagon detector；15, the 15th earth magnetism wagon detector；16, the 16th earth magnetism wagon detector；17, the 17th ground magnetic vehicle Detector；18, eighteenthly magnetic vehicle detector；19, the 19th earth magnetism wagon detector；20, the 20th earth magnetism vehicle detection Device；21, the 21st earth magnetism wagon detector；22, the 22nd earth magnetism wagon detector；23, the 23rd ground magnetic vehicle is examined Survey device；24, the 24th earth magnetism wagon detector；25, lane one；26, lane two；27, lane three；28, lane four；29, vehicle Road five；30, lane six；31, lane seven；32, lane eight；33, lane nine；34, lane ten；35, lane 11；36, lane ten Two.

Specific embodiment

To make the object, technical solutions and advantages of the present invention clearer, with reference to the accompanying drawings, the present invention is made further It is described in detail.

Each lane requires two earth magnetism wagon detectors of placement, and an earth magnetism wagon detector is placed on stop line Trip detects the vehicle number by stop line at stop line, another earth magnetism wagon detector is placed in stop line upstream 120 At rice, detection passes through the vehicle number of section at 120 meters of stop line upstream.It can be calculated by the two earth magnetism wagon detectors The vehicle number between stop line and 120 meters of stop line upstream section of any time in place lane, and it is converted into vehicle Queue length.As shown in Figure 1, the first earth magnetism wagon detector 1 and the second earth magnetism wagon detector 2 are for detecting lane 1 Vehicle queue length, third earth magnetism wagon detector 3 and the 4th earth magnetism wagon detector 4 are used to detect the vehicle row in lane 2 26 The vehicle queue that team leader's degree, the 5th earth magnetism wagon detector 5 and the 6th earth magnetism wagon detector 6 are used to detect lane 3 27 is long Degree, the 7th earth magnetism wagon detector 7 and the 8th earth magnetism wagon detector 8 are used to detect the vehicle queue length in lane 4 28, the Nine earth magnetism wagon detectors 9 and the tenth earth magnetism wagon detector 10 are used to detect the vehicle queue length in lane 5 29, and the 11st Earth magnetism wagon detector 11 and the 12nd earth magnetism wagon detector 12 are used to detect the vehicle queue length in lane 6 30, and the 13rd Earth magnetism wagon detector 13 and the 14th earth magnetism wagon detector 14 are used to detect the vehicle queue length in lane 7 31, and the 15th Earth magnetism wagon detector 15 and the 16th earth magnetism wagon detector 16 are used to detect the vehicle queue length in lane 8 32, and the 17th Earth magnetism wagon detector 17 and eighteenthly magnetic vehicle detector 18 are used to detect the vehicle queue length in lane 9 33, and the 19th Earth magnetism wagon detector 19 and the 20th earth magnetism wagon detector 20 are used to detect the vehicle queue length in lane 10, and the 20th One earth magnetism wagon detector 21 and the 22nd earth magnetism wagon detector 22 are used to detect the vehicle queue length in lane 11, 23rd earth magnetism wagon detector 23 and the 24th earth magnetism wagon detector 24 are used to detect the vehicle row in lane 12 Team leader's degree.

Crossing machine receives 1 to the 24th earth magnetism wagon detector 24 of the first earth magnetism wagon detector and amounts to 24 ground The information that magnetic vehicle detector is sent, is subsequently forwarded to traffic signal control.Every 10 seconds, traffic signal control was according to connecing The control strategy of traffic behavior and crossing the machine setting received determines control action.

A kind of traffic signal optimization control method flow chart based on Policy iteration and cluster shown in Fig. 2 includes following step It is rapid:

Step 1, selection signal control program defines traffic behavior, control action, immediate yield and Q value:

Signal timing plan to be optimized is situated between in case where four symmetrical phases below using fixed phase sequence control program Continue control program, but the present invention is not limited to use four phases, be also not necessarily limited to use symmetrical phase.Phase 1: allow one 25 He of lane Vehicle straight trip and right-hand rotation on lane 4 28, allow the vehicle on lane 2 26 and lane 5 29 to keep straight on；Phase 2: allow lane 3 27 and lane six on 30 vehicle turn left；Phase 3: allow the vehicle on lane 7 31 and lane 10 to keep straight on and turn right, permit Perhaps the vehicle straight trip on lane 8 32 and lane 11；Phase 4: allow the vehicle on lane 9 33 and lane 12 left Turn.Traffic signals can be only in one in four phases at each moment, and successively execute in sequence.Although phase is suitable Sequence be it is fixed, the long green light time of each phase need not but be fixed.Defining control action is to keep current phase or be switched to down One phase, if current phase is phase 1, after 10 seconds, traffic signal control needs Decision Control to act: keeping phase Position 1, or it is switched to phase 2, if selected phase 2, need to make a control action again by 10 seconds: keeping phase 2, or Person is switched to phase 3, if selected phase 3, needed to make a control action again by 10 seconds: keeping phase 3, or switching To phase 4, if selected phase 4, needed to make a control action again by 10 seconds: keeping phase 4, or be switched to phase 1 ... loops back and forth like this.The minimum green time for defining all phases is 10 seconds, and maximum green time is 60 seconds.

The vehicle queue length of each phase is defined as the maximum value of the vehicle queue length in all lanes of the phase, phase 1 vehicle queue length is equal to the maximum in the vehicle queue length in lane 1, lane 2 26, lane 4 28 and lane 5 29 Value；The vehicle queue length of phase 2 is equal to the maximum value in the vehicle queue length in lane 3 27 and lane 6 30；Phase 3 Vehicle queue length is equal to the maximum in the vehicle queue length in lane 7 31, lane 8 32, lane 10 and lane 11 Value；The vehicle queue length of phase 4 is equal to the maximum value in the vehicle queue length in lane 9 33 and lane 12.

The vehicle queue length that traffic behavior is current phase and next phase is defined, for example, if current phase is phase Position 1, then current traffic condition is indicated by the vector data that the vehicle queue length of phase 1 and phase 2 the two variables form.

At the time of the initial time for defining a sampling period is the movement of traffic signal control Decision Control, the sampling period Duration it is equal with the duration of decision-making period, be 10 seconds；Defining immediate yield is the vehicle that stop line is left with the single sampling period Related characteristics of number, indicate to take the direct benefit obtained after control action under a traffic behavior；Define shape State-movement is to the data vector for discrete traffic behavior and control action composition；The Q value for defining each state-movement pair is place The expectation obtained after control action accumulation return is taken under corresponding discrete traffic behavior, that is, takes several samplings after control action The expectation of the sum of the immediate yield obtained in period, what Q value represented is obtained after taking control action under discrete traffic behavior The long-term interest obtained；Defining control strategy is the control action that should be taken when giving discrete traffic behavior；

The calculation formula of immediate yield r is as follows:

In above formula, n_pIndicate the vehicle number in a sampling period by stop line, constant 6.5,4.5 in formula and- 1.0 effect is to maintain immediate yield r between [- 1,1].Traffic signal control is sent according to the adjacent machine of crossing twice Traffic behavior calculate n_p, immediate yield r then is calculated according to above formula.

State-movement pair Q value is defined as follows:

S indicates that discrete traffic behavior, a indicate the control action executed at traffic behavior s, and (s a) indicates that state-is dynamic to Q Make the Q value to s-a, E indicates expectation, and (s, a) indicates the immediate yield that execution control action a is obtained at state s to r, and γ is folding The factor is detained, is a real number between 0 and 1, k expression experienced k-th of sampling period after encountering traffic behavior s, undergo Traffic behavior s simultaneously executes control action a, and k=1 is corresponded to after a sampling period, and T expression encounters traffic behavior s post-sampling The T sampling period is terminated at, i.e. the calculating of accumulation return only uses the immediate yield in T sampling period.

Step 2, it the control action to traffic behavior, execution and leaves the vehicle number of stop line and samples.

In specified typical period of time, if morning peak or evening peak period carry out the sampling of a period of time, on sample phase, road The control strategy of traffic signal control is set as induction control, minimum green time, maximum green time and setting by mouth machine For the positive integer times in sampling period, unit green extension is identical as the sampling period, when the minimum green light of each phase is arranged Between be 10 seconds, maximum green time be 60 seconds, unit green extension be 10 seconds.It determines according to the methods below every second Plan phase: when current phase green time was less than 10 seconds, current phase is kept；Current phase green time was than or equal to 60 seconds When, it is switched to next phase；When current phase green time was between 10 seconds and 60 seconds, current phase, which has, to be carried out vehicle and just extends green light Time 10 seconds, does not carry out vehicle and be just directly switch to next phase.Every 10 seconds, crossing machine testing simultaneously stored following message as sample This: the vehicle queue length of current phase and next phase, the control action of execution and each sampling period leave stop line Vehicle number.The sample number to be acquired is set as 9000.

Step 3, after crossing machine collects 9000 samples, discretization is carried out to the traffic behavior in sample.Each sample This arrangement is the form of data vector (l, a, l ', r), and l indicates that the traffic behavior of some sampling instant, a indicate that traffic behavior is l The control action of Shi Zhihang, l ' indicate that the traffic behavior of next sampling instant after l, r indicate that traffic behavior is transferred to l ' from l This sampling period in obtain immediate yield, the vehicle for leaving stop line in original sample in each sampling period can be used Number, is calculated according to the calculation formula of immediate yield r in step 1.

Traffic behavior in sample is pre-processed, is first normalized, then removes spacing more than preset threshold Traffic behavior.Select Euclidean distance as distance, it is 0.1 that threshold value, which is arranged, and a normalized friendship is first randomly choosed from sample An empty data set, referred to as traffic state data collection is added in logical state, then traffic behavior remaining in sample under Column principle is added in data set: if the traffic behavior in sample concentrates the distance of all traffic behaviors to traffic state data Both greater than 0.1, then traffic state data collection is added in the traffic behavior, is otherwise added without.

K- mean cluster is carried out to the traffic behavior that traffic state data is concentrated, defines the collection that cluster is close traffic behavior It closes, the corresponding discrete traffic behavior of each cluster defines the mass center that mass center is all traffic behaviors that cluster includes, mass center number is arranged Be 30, after start to cluster, steps are as follows:

Step a concentrates 30 different traffic behaviors of random selection as initial mass center from traffic state data；

Step b calculates each traffic behavior to the distance of each mass center, each traffic behavior is assigned to nearest matter The heart forms 30 clusters；

Step c recalculates the mass center of each cluster；

Step d calculates the variable quantity of mass center, i.e., the distance between original mass center and new mass center, if the matter of all clusters The heart is no longer changed, and k- mean cluster terminates, no to then follow the steps b.

After k- mean cluster, in each sample (l, a, l ', r) l and l ' be assigned to nearest mass center respectively, i.e., It is separately converted to discrete traffic behavior s and s ', it is data vector (s, a, s ', r) that sample, which is arranged,.

Step 4, one traffic signal control strategy of arbitrary initialization in the machine of crossing, it is then excellent using Policy iteration method Change strategy, mass center obtained in the strategy and step 3 that optimization obtains is stored in traffic signal control；

In isolated intersection traffic signal control optimization problem, 30 discrete traffic behaviors, each discrete traffic behavior are shared Under all there are two control action --- a₁It indicates to keep current phase, a₂Expression is switched to next phase, and tactful optimization is at crossing It carries out, is optimized using Policy iteration method, steps are as follows in machine:

Step a, setting the number of iterations are 1, initialize Q value and control strategy, calculate state-transition matrix and immediate yield Matrix.The Q value of each state-movement pair is initialized as zero, is stored in matrix Q, it is straight according to sample (s, a, s ', r) estimation Take back report matrix R₁And R₂, R₁, R₂It saves respectively and executes control action a₁、a₂The expectation of the immediate yield obtained afterwards, if i=1, 2 ..., 30, j=1,2 ..., 30, k=1,2, Q, R₁And R₂Definition difference it is as follows:

Wherein, Q (s_i,a_k) expression movement-state is to s_i-a_kQ value, r (s_i,a_k,s_j) indicate to be in discrete traffic behavior s_i, execute control action a_kLater, it is transferred to discrete traffic behavior s_jWhen the immediate yield that obtains.Initialize a control strategy For any strategy, it is stored in matrix Π, Π is defined as follows:

Wherein, π (s_i,a_k) indicate in discrete state s_iLower execution acts a_kProbability, the sum of every row element of Π be 1.Root According to sample (s, a, s ', r) estimated state transfer matrix P, it is defined as follows:

Wherein, matrix element p (s_j|s_i,a_k) it is conditional probability, it indicates to be in discrete traffic behavior s_i, execute control action a_kLater, next sampling instant is transferred to discrete traffic behavior s_jProbability.Utilize R₁,R₂With the element in P, can find out Immediate yield matrix R, R are defined as follows:

Wherein, r (s_i,a_k) indicate to be in discrete traffic behavior s_i, execute control action a_kThe immediate yield obtained later It is expected that calculation formula is as follows:

Step b updates Q value, updates matrix Q according to the following formula:

Q=(I- γ P Π)^-1R

Wherein, I indicates unit matrix, and γ is discount factor, is set as 0.95, ()^-1It indicates to matrix inversion；

Step c, updates control strategy according to Q value, updates the element in matrix Π according to the following formula:

Step d, if the number of iterations is 1, preservation matrix Π to a matrix of the same dimensions Π ', the number of iterations adds 1, returns to step Rapid b, otherwise, two norms of the difference of solution matrix Π and matrix Π ':

D=| | Π-Π ' | |

If D is equal to 0, Policy iteration terminates, if D is not equal to 0, preservation matrix Π adds to matrix Π ', the number of iterations 1, return step b.

After Policy iteration, obtained control strategy is stored in matrix Q, mass center obtained in matrix Q and step 3 It is stored in traffic signal control；

Step 5, the control strategy of crossing machine setting traffic signal control is the control strategy that step 4 obtains, every 10 Second, traffic signal control receives the traffic behavior that crossing machine examination measures, it is normalized, the friendship after calculating normalization Lead to state to the distance of each mass center, finds out the number apart from nearest mass center, i.e., discrete traffic behavior s_iThe number i of state, Then control action a is selected according to the following formula^*:

Traffic signal control is control action a^*It is sent to the execution of crossing machine, if a^*Value be a₁Then keep current phase Position, if a^*Value be a₂Then it is switched to next phase.

Claims

1. a kind of traffic signal optimization control method based on Policy iteration and cluster, it is characterised in that:

The following steps are included:

Step 1, select signal timing plan to be optimized for fixed phase sequence control, it is current phase and next for defining traffic behavior The vehicle queue length of phase, defining control action is to keep current phase or be switched to next phase, defines immediate yield and is One variable related with the vehicle number of stop line is left in the single sampling period, definition status-movement is to for discrete traffic shape The data vector of state and control action composition, the Q value for defining each state-movement pair indicate under corresponding discrete traffic behavior The expectation obtained after control action accumulation return is taken, defining control strategy is the control that each discrete traffic behavior should execute Movement；

Step 2, the control strategy of traffic signal control is set as induction control by crossing machine, and minimum green time, maximum are green The lamp time is set as the positive integer times in sampling period, and unit green extension is identical as the sampling period, and crossing machine is to traffic shape The vehicle number that state, the phase of execution acted and left stop line is sampled and is recorded sample, the method for sampling are as follows: in each sampling Moment recording traffic state, control action and each sampling period leave the vehicle number of stop line；

Step 3, after crossing machine collects the sample specified number, discretization, discretization side are carried out to the traffic behavior in sample Method are as follows: the traffic behavior first obtained to sampling is normalized, and removes the traffic behavior that spacing is more than preset threshold, then into Row k- mean cluster, obtained mass center is numbered, the corresponding discrete traffic behavior of each mass center, and normalization sample Traffic behavior in this is indicated with the number of nearest mass center, obtains corresponding discrete traffic behavior；

Step 4, crossing machine uses Policy iteration optimisation strategy, and mass center obtained in the strategy and step 3 that optimization obtains is saved In traffic signal control；

Step 5, the control strategy of crossing machine setting traffic signal control is the control strategy that step 4 obtains, and in decision week Phase is set as the sampling period, and at each decision moment, traffic signal control receives the traffic behavior that crossing machine examination measures, and carries out Normalization, the traffic behavior after calculating normalization is found out to the distance of each mass center apart from nearest mass center, corresponding according to mass center Discrete traffic behavior inquire control strategy, obtain control action and be sent to crossing machine execution,

Wherein used Policy iteration method comprises the steps of:

Step a, setting the number of iterations are 1, initialize Q value and control strategy, calculate state-transition matrix and immediate yield matrix, The Q value of each state-movement pair is initialized as zero, is stored in matrix Q, according to sample (s, a, s ', r) estimation immediate yield Matrix R₁And R₂, s indicates the traffic behavior of some sampling instant, and a indicates the control action executed when discrete traffic behavior is s, always It altogether include two kinds of control actions, control action a₁It is to maintain current phase, control action a₂It is to switch to next phase, s ' indicates s The discrete traffic behavior of next sampling instant later, r indicate discrete traffic behavior out of, s is transferred to s ' this sampling period The immediate yield of acquisition, calculation formula are as follows:

Wherein, n_pIndicate the vehicle number in a sampling period by stop line, R₁, R₂It saves respectively and executes control action a₁、a₂ The expectation of the immediate yield obtained afterwards, Q, R₁And R₂Definition difference it is as follows:

Wherein, n indicates the mass center number used when clustering in the step 3, Q (s_i,a_k) indicate state-movement to s_i-a_kQ value, r (s_i,a_k,s_j) indicate to be in discrete traffic behavior s_i, execute control action a_kLater, it is transferred to discrete traffic behavior s_jWhen obtain Immediate yield, the value range of i and j are all the integers between [1, n], and the value range of k is integer 1 and 2, initialize one Control strategy is any strategy, is stored in matrix Π, Π is defined as follows:

Wherein, π (s_i,a_k) indicate in discrete state s_iLower execution acts a_kProbability, the sum of every row element of Π be 1, according to sample This (s, a, s ', r) and estimated state transfer matrix P, it is defined as follows:

Wherein, matrix element p (s_j|s_i,a_k) it is conditional probability, it indicates to be in discrete traffic behavior s_i, execute control action a_kIt Afterwards, next sampling instant is transferred to discrete traffic behavior s_jProbability, utilize R₁,R₂With the element in P, can find out directly Matrix R is returned, R is defined as follows:

Wherein, r (s_i,a_k) indicate to be in discrete traffic behavior s_i, execute control action a_kThe expectation of the immediate yield obtained later, Calculation formula is as follows:

Step b updates Q value, updates matrix Q according to the following formula:

Q=(I- γ P Π)^-1R

Step d, if the number of iterations is 1, preservation matrix Π to a matrix of the same dimensions Π ', the number of iterations adds 1, return step b, Otherwise, two norms of the difference of solution matrix Π and matrix Π ':

D=| | Π-Π ' | |

If D is equal to 0, Policy iteration terminates, if D is not equal to 0, preservation matrix Π adds 1 to matrix Π ', the number of iterations, returns Return step b.