CN105279978A

CN105279978A - Intersection traffic signal control method and device

Info

Publication number: CN105279978A
Application number: CN201510665966.1A
Authority: CN
Inventors: 王飞跃; 刘裕良; 段艳杰; 吕宜生; 朱凤华; 苟超
Original assignee: Qingdao Huicheng Intelligent Technology Co Ltd; Qingdao Intelligent Industry Institute For Research And Technology
Current assignee: Qingdao Huicheng Intelligent Technology Co Ltd; Qingdao Intelligent Industry Institute For Research And Technology
Priority date: 2015-10-15
Filing date: 2015-10-15
Publication date: 2016-01-27
Anticipated expiration: 2035-10-15
Also published as: CN105279978B

Abstract

The invention relates to an intersection traffic signal control method and device. The method can learn from environment feedbacks according to traffic status so as to achieve adaptive control of traffic signals. The method comprises: defining system parameters; setting up an Action network and a Critic network; initializing a controller; obtaining a corresponding system control parameter according to system status; obtaining a performance index according to status and actions; training the Critic network and the Action network alternatively; recording network weights after achieving training goals; and using the trained Critic network and the trained Action network to conduct online control. The method and the device use an ADHDP method to provide an effective approach for adaptive control of intersection traffic signals.

Description

Intersection traffic signal control method and equipment

Technical field

The present invention relates to urban traffic signal control field, be specifically related to a kind of intersection traffic signal control method and equipment.

Background technology

Along with the rapid growth of China's economic and the quickening of urbanization process, a large amount of population pours in city, and means of transportation are built and the speed of improvement is unable to catch up with the growing transport need of people far away, and traffic jam issue becomes increasingly conspicuous.

Outside the factors such as the reason that traffic jam issue occurs is many-sided, and removing means of transportation are inadequate, traffic programme is unreasonable and public's sense of traffic is thin, a very important factor is that existing urban traffic signal control system does not play one's part to the full.Due to the singularity of urban transport problems, be difficult to set up accurate mathematical model.Simple timing controlled, induction control method are difficult to the traffic adapting to become increasingly complex.

Self-adaptation dynamic programming (ADP) theory has merged the methods such as dynamic programming, intensified learning and approximation of function, it utilizes online or off-line data, adopt approximation to function structure to carry out the performance index function of estimating system, then obtain the control survey of near-optimization according to the principle of optimization.It is a kind of typical self-adaptation dynamic programming method that heuristic dynamic programming (ADHDP) method is relied in action, because it has the feature of model-free adaption, system parameter variations can be met frequent, requirement of real-time is higher, is difficult to the control overflow of the Traffic Systems setting up accurate model.

Summary of the invention

One aspect of the present invention provides a kind of ADHDP controller off-line training method controlled for intersection traffic signal, this ADHDP controller comprises Action network and Critic network, the method comprises: in step S1, define system state, Reward Program, split and system control parameters; In step S2, set up Action network and Critic network, wherein: Action network is the BP neural network with a hidden layer, and wherein input layer number is P, and output layer neuron number is P-1, the neuron number of hidden layer is M _a, M _afor empirical value; And Critic network is the BP neural network with a hidden layer, wherein input layer number is 2P-1, and output layer neuron number is 1, and the neuron number of hidden layer is M _c, M _cfor empirical value; In step S3, initialization ADHDP controller, comprising: initialization Action network weight and initialization Critic network weight; In step S4, before each control cycle terminates, obtain system state, input to Action network, export corresponding system controling parameters u (k), system control parameters u (k) is exported to simulation software to instruct the operation of next cycle; In step S5, system state S (k) and system control parameters u (k) are inputed to Critic network, output performance index J (k); In step S6, alternately train Critic network according to performance index and Reward Program and train Action network, with the weights of the weights and Action network that upgrade Critic network according to performance index; And in step S7, judge whether the target reaching expection setting: when reaching the target of expection setting, in step S8, off-line training terminates, and records the weights of final Action network and the weights of Critic network; Otherwise, return step S6 and continue training.

Another aspect of the present invention provides a kind of use carrys out On-line Control intersection traffic signal method according to the ADHDP controller that above method is trained, comprising: respectively with the weight initialization Action network of the weights of final Action network and Critic network and Critic network; The real time traffic data of on-line system is input to ADHDP controller; And according to the definition in step S1, obtain system state from the real time traffic data of on-line system, system state is inputted Action network, using the output of Action network as system control parameters, for Dominating paths oral sex messenger.

Another aspect provides a kind of ADHDP controller off-line training equipment controlled for intersection traffic signal, this ADHDP controller comprises Action network and Critic network, this equipment comprises: first device, define system state, Reward Program, split and system control parameters; Second device, sets up Action network and Critic network, and wherein: Action network is the BP neural network with a hidden layer, and wherein input layer number is P, and output layer neuron number is P-1, the neuron number of hidden layer is M _a, M _afor empirical value; And Critic network is the BP neural network with a hidden layer, wherein input layer number is 2P-1, and output layer neuron number is 1, and the neuron number of hidden layer is M _c, M _cfor empirical value; 3rd device, initialization ADHDP controller, comprising: initialization Action network weight and initialization Critic network weight; 4th device, before each control cycle terminates, obtains system state, inputs to Action network, export corresponding system controling parameters u (k), system control parameters u (k) is exported to simulation software to instruct the operation of next cycle; 5th device, inputs to Critic network, output performance index J (k) by system state S (k) and system control parameters u (k); 6th device, alternately trains Critic network according to performance index and Reward Program and trains Action network, with the weights of the weights and Action network that upgrade Critic network according to performance index; And the 7th device, judge whether the target reaching expection setting: when reaching the target of expection setting, off-line training terminates, and records the weights of final Action network and the weights of Critic network; Otherwise, use the 6th device to continue training.

Another aspect of the present invention provides the equipment that a kind of ADHDP controller using above equipment to train carrys out On-line Control intersection traffic signal, comprise: the 8th device, respectively with the weight initialization Action network of the weights of final Action network and Critic network and Critic network; 9th device, is input to ADHDP controller by the real time traffic data of on-line system; And the tenth device, according to the definition in first device, obtain system state from the real time traffic data of on-line system, system state is inputted Action network, using the output of Action network as system control parameters, for Dominating paths oral sex messenger.

The present invention effectively overcomes deficiency of the prior art.Intersection traffic signal control method of the present invention has on-line study ability, can change in the magnitude of traffic flow, in the complex environment of the practical engineering application such as non-vehicle flow large percentage, by the study to environmental feedback, calculate the timing parameter of crossing, realize the effective control to the changeable crossing of traffic flow.The method does not need to set up traffic model, can according to traffic behavior, and simulation human brain is learnt by environmental feedback, thus realizes the adaptive control to traffic signals.

Accompanying drawing explanation

Fig. 1 diagrammatically illustrates off-line training method flow diagram of the present invention.

Fig. 2 diagrammatically illustrates ADHDP structure and training schematic diagram.

Fig. 3 diagrammatically illustrates Action network and Critic schematic network structure.

Embodiment

Below in conjunction with drawings and Examples, technical scheme of the present invention is described in further detail.Following examples are implemented under premised on technical solution of the present invention, give detailed embodiment and process, but protection scope of the present invention is not limited to following embodiment.

With reference to figure 1 and Fig. 2, embodiments of the invention are described.Fig. 1 diagrammatically illustrates ADHDP controller off-line training method flow diagram of the present invention.Fig. 2 diagrammatically illustrates ADHDP structure and training schematic diagram.Hereinafter, be described for the crossing of a two phase place.

As shown in Figure 1, the method starts from step S0.

In step S1, define system state, Reward Program, split and system control parameters.

Define system state as follows.Suppose there be P phase place in each control cycle, phase time length is T _i, each phase place has L _iindividual track obtains right of access, and each track maximum queue length is h _i, phase place queue length H _i=max{h _i, phase average queue length the flow in each track is q _j, phase place flow is Q _i=max{q _j, definition phase place saturation degree is wherein 1≤i≤P, 1≤j≤L _i, ε is normaliztion constant.

Define system state is S (k)={ s _i(k) }, 1≤i≤P, wherein k is emulation step number, and step-length is the time span C of a kth control cycle _k, Cycle Length can be determined according to historical traffic Webster method, and value is usually between 30 seconds to 120 seconds.

Definition Reward Program is wherein N=P-1, P>=2.

Definition split is a _i, wherein 1≤i≤P-1.The split of last phase place

a_{P} = Σ_{i = 1}^{P - 1} a_{i} .

System control parameters is u (k)={ a _i(k) }, 1≤i≤P.

In the example of two phase place, system state is S (k)={ s _i(k) }, wherein i=1,2.The split of first phase place is a ₁, then second phase place split is had to be a ₂=1-a ₁.

In step S2, set up Action network and Critic network.As shown in Figure 3, Action network is the BP neural network with a hidden layer, and wherein input layer number is P, and output layer neuron number is P-1, and the neuron number of hidden layer is M _a, hidden neuron number M _afor empirical value, usually between 5 ~ 20.Critic network is the BP neural network with a hidden layer, and wherein input layer number is 2P-1, and output layer neuron number is 1, and the neuron number of hidden layer is M _c, hidden neuron number M _cfor empirical value, usually between 5 ~ 20.

In the example of two phase place, Action network is the BP neural network with a hidden layer, and wherein input layer number is 2, and output layer neuron number is 2, and the neuron number of hidden layer is 8.Critic network is the BP neural network with a hidden layer, and wherein input layer number is 3, and output layer neuron number is 1, and the neuron number of hidden layer is 8.

In step S3, initialization controller, comprises initialization Action network weight and Critic network weight.The learning rate of Action network can be set to l _a, learning rate l _abe generally the constant between 0 ~ 1, each step frequency of training is set to N _a, frequency of training N _afor empirical value, usually between 5 ~ 50.The learning rate of Critic network can be set to l _c, learning rate l _cbe generally the constant between 0 ~ 1, each step frequency of training is set to N _c, frequency of training N _cfor empirical value, usually between 5 ~ 50.For Action network and Critic network, all Sigmoid function can be adopted as activation function, β gets 1 usually.

In the example of two phase place, initialization Action network weight gets the random number between 0 to 1, and learning rate is 0.3, and each step frequency of training is 5.Initialization Critic network weight gets the random number between 0 to 1, and learning rate is 0.1, and each step frequency of training is 5.

In step S4, before each control cycle terminates, obtain system state, input to Action network, export corresponding system controling parameters u (k).Such as, the flow q in each track, crossing collected can be received from simulation software _jand queue length h _idata, obtain system state S (k), using the input of system state as Action network, obtain corresponding output u (k), system control parameters u (k) are exported to simulation software to instruct the operation of next cycle.In the present embodiment, adopt paramic simulation software to be connected with controller, controller and simulation software are by shared file interactive information.

In step S5, system state S (k) and system control parameters u (k) are inputed to Critic network, output performance index J (k).

In step S6, alternately training Critic network and Action network, comprising:

The training error of Critic network is defined as:

E_{c} (k) = \frac{1}{2} {[a J (k) - J (k - 1) + r (k)]}^{2}

A value usually between 0 ~ 1, a=0.2 in the example of two phase place.

The right value update of Critic network is in the following way:

w _c(k+1)＝w _c(k)+Δw _c(k)

{Δw}_{c} (k) = - \frac{\partial E_{c} (k)}{\partial w_{c} (k)} = - \frac{\partial E_{c} (k)}{\partial J (k)} \frac{\partial J (k)}{\partial w_{c} (k)}

The training error of Action network is defined as:

E_{a} = \frac{1}{2} {[J (k) - G_{c} (k)]}^{2}

G in formula _ck () is control objectives, G in the example of two phase place _c(k)=0.

The right value update of Action network is in the following way:

w _a(k+1)＝w _a(k)+Δw _a(k)

{Δw}_{a} (k) = - \frac{\partial E_{a} (k)}{\partial w_{a} (k)} = - \frac{\partial E_{a} (k)}{\partial J (k)} \frac{\partial J (k)}{\partial u (k)} \frac{\partial u (k)}{\partial w_{a} (k)}

Alternately training flow process is as follows: by the flow q based on each track, crossing _jand queue length h _inetwork state Deng traffic data inputs to Action network, obtains system control parameters u (k), and input system state and system control parameters u (k), to evaluating network, obtain performance index.Calculate the training error of Critic network according to performance index and Reward Program, and upgrade the weights of Critic network according to this training error.According to the training error of performance Index Calculation Action network, and upgrade the weights of Action network according to this training error.So move in circles, to reaching the target of expection setting.

In step S7, judge whether to reach training objective.When reaching the target of setting of looking ahead, in step S8, off-line training terminates, and records the weights of final Action network and the weights of Critic network.Otherwise, return step S6 and continue training.

In the present embodiment, the target of expection setting is: | e _a| < 0.05, | e _c| < 0.05, wherein: e _a=J (k), e _c=α J (k)-J (k-1)+r (k).The weights of Action network and Critic network are recorded after reaching target.

Present invention also offers a kind of method that ADHDP controller using above method to train carrys out On-line Control intersection traffic signal, comprising:

Respectively with the weight initialization Action network of the weights of final Action network and Critic network and Critic network, the real time data of on-line system (is comprised the flow q in each track, crossing _jand queue length h _i) be input to ADHDP controller, obtain system state according to the definition in step S1, system state is inputted Action network, using the output of Action network as system control parameters, for Dominating paths oral sex messenger.Alternatively, the method can also comprise carries out on-line training according to step S5 and S6, with the weights of the weights of real-time update Action network and Critic network.

Above method step of the present invention is non-essential to be performed with illustrated order.Under the prerequisite not departing from spirit of the present invention, in an alternate embodiment, above-mentioned steps and/or some step of executed in parallel can be performed with different order.These modification all fall into protection scope of the present invention.

Said method of the present invention can be performed the computer instruction be stored in memory device and realize by the equipment (such as processor) with computing function.An example of this implementation is a kind of ADHDP controller off-line training equipment controlled for intersection traffic signal, this ADHDP controller comprises Action network and Critic network, this equipment comprises: first device, define system state, Reward Program, split and system control parameters; Second device, sets up Action network and Critic network, and wherein: Action network is the BP neural network with a hidden layer, and wherein input layer number is P, and output layer neuron number is P-1, the neuron number of hidden layer is M _a, M _afor empirical value; And Critic network is the BP neural network with a hidden layer, wherein input layer number is 2P-1, and output layer neuron number is 1, and the neuron number of hidden layer is M _c, M _cfor empirical value; 3rd device, initialization ADHDP controller, comprising: initialization Action network weight and initialization Critic network weight; 4th device, before each control cycle terminates, obtains system state, inputs to Action network, export corresponding system controling parameters u (k), system control parameters u (k) is exported to simulation software to instruct the operation of next cycle; 5th device, inputs to Critic network, output performance index J (k) by system state S (k) and system control parameters u (k); 6th device, alternately trains Critic network according to performance index and Reward Program and trains Action network, with the weights of the weights and Action network that upgrade Critic network according to performance index; And the 7th device, judge whether the target reaching expection setting: when reaching the target of expection setting, off-line training terminates, and records the weights of final Action network and the weights of Critic network; Otherwise, use the 6th device to continue training.

Another example of this implementation is the equipment that a kind of ADHDP controller using above equipment to train carrys out On-line Control intersection traffic signal, comprise: the 8th device, respectively with the weight initialization Action network of the weights of final Action network and Critic network and Critic network; 9th device, is input to ADHDP controller by the real time traffic data of on-line system; And the tenth device, according to the definition in first device, obtain system state from the real time traffic data of on-line system, system state is inputted Action network, using the output of Action network as system control parameters, for Dominating paths oral sex messenger.

In this implementation, each device above-mentioned is that computing equipment performs instruction and the corresponding function module of producing.

Although illustrate and describe the present invention with reference to certain exemplary embodiments of the present invention, but those skilled in the art should understand that, when not deviating from the spirit and scope of the present invention of claims and equivalents thereof, the multiple change in form and details can be carried out to the present invention.Therefore, scope of the present invention should not be limited to above-described embodiment, but should not only be determined by claims, is also limited by the equivalent of claims.

Claims

1., for the ADHDP controller off-line training method that intersection traffic signal controls, this ADHDP controller comprises Action network and Critic network, and the method comprises:

In step S1, define system state, Reward Program, split and system control parameters;

In step S2, set up Action network and Critic network, wherein:

Action network is the BP neural network with a hidden layer, and wherein input layer number is P, and output layer neuron number is P-1, and the neuron number of hidden layer is M _a, M _afor empirical value; And

Critic network is the BP neural network with a hidden layer, and wherein input layer number is 2P-1, and output layer neuron number is 1, and the neuron number of hidden layer is M _c, M _cfor empirical value;

In step S3, initialization ADHDP controller, comprising: initialization Action network weight and initialization Critic network weight;

In step S4, before each control cycle terminates, obtain system state, input to Action network, export corresponding system controling parameters u (k), system control parameters u (k) is exported to simulation software to instruct the operation of next cycle;

In step S5, system state S (k) and system control parameters u (k) are inputed to Critic network, output performance index Jk);

In step S6, alternately train Critic network according to performance index and Reward Program and train Action network, with the weights of the weights and Action network that upgrade Critic network according to performance index; And

In step S7, judge whether the target reaching expection setting: when reaching the target of expection setting, in step S8, off-line training terminates, and records the weights of final Action network and the weights of Critic network; Otherwise, return step S6 and continue training.

2. method according to claim 1, wherein define system state, Reward Program, split and system control parameters comprise:

Define system state, comprising: suppose there be P phase place in each control cycle, and phase time length is T _i, each phase place has L _iindividual track obtains right of access, and each track maximum queue length is h _i, phase place queue length H _i=max{h _i, phase average queue length the flow in each track is q _j, phase place flow is Q _i=max{q _j, definition phase place saturation degree is wherein 1≤i≤P, 1≤j≤L _i, _`ε is normaliztion constant, and define system state is Sk)={ s _i(k) }, 1≤i≤P, wherein k is emulation step number, and step-length is the time span C of a kth control cycle _k, determine C according to historical traffic Webster method _k;

Definition Reward Program is wherein N=P-1, P>=2;

Definition split is a _i, wherein 1≤i≤P-1.Split is the green light duration of i-th phase place and the ratio of the duration of control cycle, the split of last phase place and

Define system controling parameters is u (k)={ a _i(k)), 1≤i≤P.

3. method according to claim 2, wherein each control cycle is the traffic signals period of change that of given crossing is complete.

4. method according to claim 2, wherein each phase place corresponds to a kind of traffic signal state at given crossing.

5. method according to claim 1, wherein initialization ADHDP controller also comprises:

The learning rate of Action network is set to l _a, learning rate l _avalue between for 0 ~ 1, each step frequency of training is set to as N _a, frequency of training N _avalue between 5 ~ 50;

The learning rate of Critic network is set to l _c, learning rate l _cvalue between 0 ~ 1, each step frequency of training be set to N _c, frequency of training N _cvalue between 5 ~ 50; And

For Action network and Critic network, all use Sigmoid function as activation function, β equals 1.

6. method according to claim 2, wherein obtains system state and comprises: the flow q receiving each track, crossing from simulation software _jand queue length h _idata, obtain system state S (k).

7. method according to claim 2, wherein train Critic network and Action network to comprise:

The training error of Critic network is calculated according to performance index and Reward Program;

The weights of Critic network are upgraded according to this training error;

According to the training error of performance Index Calculation Action network; And

The weights of Action network are upgraded according to this training error.

8. method according to claim 7, wherein:

The training error of Critic network is defined as:

E_{c} (k) = \frac{1}{2} {[α J (k) - J (k - 1) + r (k)]}^{2},

A value is between 0 ~ 1.

The right value update of Critic network is in the following way:

w _c(k+1)＝w _c(k)+Δw _c(k)

{Δw}_{c} (k) = - \frac{\partial E_{c} (k)}{\partial w_{c} (k)} = - \frac{\partial E_{c} (k)}{\partial J (k)} \frac{\partial J (k)}{\partial w_{c} (k)};

The training error of Action network is defined as:

g in formula _ck () is control objectives, G _c(k)=0;

The right value update of Action network is in the following way:

w _a(k+1)＝w _a(k)+Δw _a(k)

{Δw}_{a} (k) = - \frac{\partial E_{a} (k)}{\partial w_{a} (k)} = - \frac{\partial E_{a} (k)}{\partial J (k)} \frac{\partial J (k)}{\partial u (k)} \frac{\partial u (k)}{\partial w_{a} (k)} .

9. method according to claim 1, wherein M _avalue between 5 ~ 20, M _cvalue between 5 ~ 20.

10. method according to claim 1, wherein:

The target of expection setting is crossing total delay time or each track average vehicle speed;

If expection setting target be the crossing total delay time, then in step S7, when the total delay time be less than or close to expection setting the total delay time time, the method proceeds to step S8, otherwise return step S6 continue training;

If expection setting target be each track average vehicle speed, then when each track average vehicle speed is greater than or close to expection setting average vehicle speed time, the method proceeds to step S8, otherwise return step S6 continue training.

11. method according to claim 2, wherein C _kvalue is between 30 seconds to 120 seconds.

12. 1 kinds of methods using the ADHDP controller of the training of the method any one of claim 1-11 to carry out On-line Control intersection traffic signal, comprising:

Respectively with the weight initialization Action network of the weights of final Action network and Critic network and Critic network;

The real time traffic data of on-line system is input to ADHDP controller; And

According to the definition in step S1, obtain system state from the real time traffic data of on-line system, system state is inputted Action network, using the output of Action network as system control parameters, for Dominating paths oral sex messenger.

13. methods according to claim 12, wherein real time traffic data comprises the flow q in each track, crossing _jand queue length h _i.

14. methods according to claim 12, also comprise and carry out on-line training according to step S5 and S6, with the weights of the weights of real-time update Action network and Critic network.

15. methods according to claim 12, wherein the real time traffic data of on-line system comprises the flow q in each track, crossing _jand queue length h _i.

16. 1 kinds of ADHDP controller off-line training equipment controlled for intersection traffic signal, this ADHDP controller comprises Action network and Critic network, and this equipment comprises:

First device, define system state, Reward Program, split and system control parameters;

Second device, sets up Action network and Critic network, wherein:

3rd device, initialization ADHDP controller, comprising: initialization Action network weight and initialization Critic network weight;

4th device, before each control cycle terminates, obtains system state, inputs to Action network, export corresponding system controling parameters u (k), system control parameters u (k) is exported to simulation software to instruct the operation of next cycle;

5th device, inputs to Critic network, output performance index J (k) by system state S (k) and system control parameters u (k);

6th device, alternately trains Critic network according to performance index and Reward Program and trains Action network, with the weights of the weights and Action network that upgrade Critic network according to performance index; And

7th device, judges whether the target reaching expection setting: when reaching the target of expection setting, off-line training terminates, and records the weights of final Action network and the weights of Critic network; Otherwise, use the 6th device to continue training.

17. equipment according to claim 16, wherein define system state, Reward Program, split and system control parameters comprise:

Define system state, comprising: suppose there be P phase place in each control cycle, and phase time length is T _i, each phase place has L _iindividual track obtains right of access, and each track maximum queue length is h _i, phase place queue length H _i=max{h _i, phase average queue length the flow in each track is q _j, phase place flow is Q _i=max{q _j, definition phase place saturation degree is wherein 1≤i≤P, 1≤j≤L _i, ε is normaliztion constant, and define system state is S (k)={ s _i(k}, 1≤i≤P, wherein k is emulation step number, and step-length is the time span C of a kth control cycle _k, determine Ck according to historical traffic Webster method _;

Definition Reward Program is wherein N=P-1, P>=2;

Define system controling parameters is u (k)={ a _i(k) }, 1≤i≤P.

18. equipment according to claim 17, wherein each control cycle is the traffic signals period of change that of given crossing is complete.

19. equipment according to claim 17, wherein each phase place corresponds to a kind of traffic signal state at given crossing.

20. equipment according to claim 16, wherein initialization ADHDP controller also comprises:

21. equipment according to claim 17, wherein obtain system state and comprise: the flow q receiving each track, crossing from simulation software _jand queue length h _idata, obtain system state S (k).

22. equipment according to claim 17, wherein train Critic network and Action network to comprise:

The weights of Critic network are upgraded according to this training error;

The weights of Action network are upgraded according to this training error.

23. equipment according to claim 22, wherein:

The training error of Critic network is defined as:

E_{c} (k) = \frac{1}{2} {[α J (k) - J (k - 1) + r (k)]}^{2},

A value is between 0 ~ 1.

The right value update of Critic network is in the following way:

w _c(k+1)＝w _c(k)+Δw _c(k)

{Δw}_{c} (k) = - \frac{\partial E_{c} (k)}{\partial w_{c} (k)} = - \frac{\partial E_{c} (k)}{\partial J (k)} \frac{\partial J (k)}{\partial w_{c} (k)};

The training error of Action network is defined as:

g in formula _ck () is control objectives, G _c(k)=0;

The right value update of Action network is in the following way:

w _a(k+1)＝w _a(k)+Δw _a(k)

{Δw}_{a} (k) = - \frac{\partial E_{a} (k)}{\partial w_{a} (k)} = - \frac{\partial E_{a} (k)}{\partial J (k)} \frac{\partial J (k)}{\partial u (k)} \frac{\partial u (k)}{\partial w_{a} (k)} .

24. equipment according to claim 16, wherein M _avalue between 5 ~ 20, M _cvalue between 5 ~ 20.

25. equipment according to claim 16, wherein:

If the target of expection setting is the crossing total delay time, then in the 7th device, during the total delay time be less than when the total delay time or set close to expection, off-line training terminates, record the weights of final Action network and the weights of Critic network, otherwise use the 6th device to continue training;

If the target of expection setting is each track average vehicle speed, then when the average vehicle speed that each track average vehicle speed is greater than or close expection sets, this off-line training terminates, record the weights of final Action network and the weights of Critic network, otherwise use the 6th device to continue training.

26. equipment according to claim 17, wherein C _kvalue is between 30 seconds to 120 seconds.

27. 1 kinds of equipment using the ADHDP controller of the training of the equipment any one of claim 16-26 to carry out On-line Control intersection traffic signal, comprising:

8th device, respectively with the weight initialization Action network of the weights of final Action network and Critic network and Critic network;

9th device, is input to ADHDP controller by the real time traffic data of on-line system; And

Tenth device, according to the definition in first device, obtains system state from the real time traffic data of on-line system, system state is inputted Action network, using the output of Action network as system control parameters, for Dominating paths oral sex messenger.

28. equipment according to claim 27, wherein real time traffic data comprises the flow q in each track, crossing _jand queue length h _i.

29. equipment according to claim 27, also comprise use the 5th device and the 6th device carries out on-line training, with the weights of the weights of real-time update Action network and Critic network.

30. equipment according to claim 27, wherein the real time traffic data of on-line system comprises the flow q in each track, crossing _jand queue length h _i.