CN103683337B

CN103683337B - A kind of interconnected network CPS instruction dynamic assignment optimization method

Info

Publication number: CN103683337B
Application number: CN201310656811.2A
Authority: CN
Inventors: 余涛; 张孝顺
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2013-12-05
Filing date: 2013-12-05
Publication date: 2016-01-06
Anticipated expiration: 2033-12-05
Also published as: CN103683337A

Abstract

The invention discloses a kind of interconnected network CPS instruction dynamic assignment optimization method, comprise the following steps: step 1, determine control objectives; Step 2, determine state discrete collection S; Step 3, selection balance unit and determine teamwork discrete set A; Step 4, calculate the instantaneous value in this region ACE (k) and the instantaneous value of CPS (k); Step 5, obtain the R of award value immediately of each intelligent body _i(k); Step 6, ask for correlated equilibrium linkage strategy by linear equalization and equilibrium selection function; Step 7, corresponding operating is performed to all unit j; Step 8, when upper once control cycle arrives, return step 4.There is the frequent adjustment number of times that effectively can reduce all kinds of unit, improve the CPS control performance of AGC system, be specially adapted to the advantages such as thermoelectricity is dominant, the interconnected network CPS instruction dynamic assignment optimization of Unit Combination complexity.

Description

A kind of interconnected network CPS instruction dynamic assignment optimization method

Technical field

The present invention relates to electric power system automatic generation control technical field (i.e. frequency modulation frequency modulation), in particular to a kind of interconnected network CPS instruction dynamic assignment optimization method, this dynamic assignment optimization method is applicable to that thermoelectricity is dominant, the interconnected network CPS instruction dynamic assignment optimization of Unit Combination complexity.

Background technology

Since at interconnected network automatic generation control (AutomaticGenerationControl, control performance standard (ControlPerformanceStandard is proposed AGC), CPS), after, the qualification rate of CPS just becomes the key factor affecting AGC control strategy.Total Jiu Shi one of committed step of AGC control system is assigned to each AGC unit CPS regulating command according to certain optimized algorithm.

Traditional AGC regulating power have employed average distribution system when distributing, and does not consider the difference between each unit, can not meet CPS and regulate needs.Except intensified learning, the existing majority of the design about CPS control strategy is classical PI control structure, all can improve CPS index, NARX neural network prediction also introduced in its Chinese and fuzzy control principle is studied CPS control strategy, on the basis of improving CPS examination rate, to a certain degree reduce the frequent movement of unit.Conventional PI control and NARX neural network prediction and fuzzy control can ensure that the model uncertainty to controlled object exists has higher robustness, but also there is certain shortcoming in optimized design.Existing theoretical research shows, the height self study that intensified learning method has has better harmony and robustness with from optimizing ability solving dispatching terminal optimal power generation controlling party mask.Yu Tao, Wang Yuming, Liu Qianjin proposes a kind of CPS instruction dynamic optimal distribution method based on Q study in " interconnected network CPS regulating command dynamic optimal distributes Q-learning algorithm " (Proceedings of the CSEE), the change of running environment can be adapted to well, distribution behavior is not fixed, and improves control adaptability and the robustness of whole AGC system.Be dominant at thermoelectricity for single step Q study, application table during automatic generation control (AGC) power instruction dynamic optimization that unit time delay is larger distributes reveals convergence rate and waits deficiency slowly and the acquisition that affects optimal policy, Yu Tao, Wang Yuming, Zhen Weiguo, solve the time delay report problems that fired power generating unit long time delay link brings Deng introducing eligibility trace in (control theory and application) at " the automatic generation control instruction dynamic optimization allocation algorithm based on multistep backtracking Q study ", improve algorithm the convergence speed, meet the requirement of real-time of application on site, and system fading margin cost is saved under the prerequisite keeping AGC high qualification rate.For solving based on the dimension disaster problem in the assigning process under Q study multiple stage unit, Yu Tao, Wang Yuming, Ye Wenjia, the whole network unit is done just subseries by frequency modulation time delay by Liu Qianjin in " the CPS instruction multiobjective Dynamic Optimization allocation algorithm based on improving Hierarchical reinforcement learning ", CPS instruction successively distributes formation task hierarchy, and become coordinating factor when layering Q learning algorithm introduces one between layers, the layering Q learning algorithm of improvement effectively improves former algorithm the convergence speed.Although classical intensified learning can meet the equilibrium point obtaining under electrical network CPS appraisal standards prerequisite restraining, but adopt the distribution factor action policy that unit output interblock space is limited in the assignment procedure, make the equilibrium point sought might not be optimum equilibrium point, the adjustment of all kinds of unit is more frequent, convergence step number is also relatively long, and after convergence, CPS1 and ACE real-time curve is level and smooth not.In addition, Q study, Q (λ) study and layering Q study are all single intelligent body nitrification enhancement in itself, do not relate to the Cooperative Study between each intelligent body, and the combination of actions of each intelligent body might not be the optimum action of associating.The inventive method CEQ (λ) (Correlated-Equilibrium-Q (λ)) be can be formed by the countermeasure game between multiple intelligent body of correlated equilibrium intensified learning learn than single intelligent body Q, conventional PI control and NARX neural network prediction and the more excellent equilibrium point of fuzzy control, be more suitable for that coal electricity is dominant, the interconnected network CPS instruction dynamic optimal of Unit Combination complexity distributes, effectively improve adaptability and the robustness of system.

Summary of the invention

The object of the invention is to overcome the shortcoming of prior art and deficiency, there is provided a kind of interconnected network CPS instruction dynamic assignment optimization method, this optimization method is a kind of interconnected network CPS instruction dynamic assignment optimization method based on CEQ (λ) multiple agent Cooperative Study; CEQ (λ) learning algorithm, it is the improvement to CEQ algorithm, also be the important watershed that intensified learning develops from single intelligent body to multiple agent, the dynamic action strategy of each intelligent body is no longer merely decided by self historical action strategy and award value, but the dynamic equilibrium point that the action probability depending on other intelligent body is formed.In addition, at CEQ (λ) in the application of CPS instruction dynamic assignment, the command assignment action of every type AGC unit mentions the proportionality coefficient adopted in document before being no longer, but the increase and decrease of actual set action is exerted oneself, the teamwork interblock space of all types AGC unit is more much bigger than what mention in document above, improves the probability of the more excellent equilibrium point of searching.

Object of the present invention is achieved through the following technical solutions: a kind of interconnected network CPS instruction dynamic assignment optimization method, comprises the following steps:

Step 1, determine control objectives;

Step 2, determine state discrete collection S;

Step 3, select a class unit for balance unit, other units participate in CEQ (λ) Cooperative Study, determine teamwork discrete set A simultaneously;

Step 4, gather the real-time running data of institute control area electrical network when each control cycle starts, the practical adjustments that described real-time running data comprises frequency deviation f, power deviation Δ P and Ge Tai unit is exerted oneself Δ P _gi, calculate the instantaneous value of this area control error ACE (k) and the instantaneous value of control performance standard C PS (k);

Step 5, by current state s, obtain the R of award value immediately of unit i _i(k);

Step 6, to be retrained by linear equalization

\underset{a_{- i} &Element; A_{- i}}{Σ} π_{s} (a) Q_{i} (s, a) &GreaterEqual; \underset{a_{- i} &Element; A_{- i}}{Σ} π_{s} (a) Q_{i} (s, (a_{- i}, a_{i}^{′}))

With equilibrium selection function ask for the optimum linkage strategy π of correlated equilibrium _s ^*;

Wherein, A _-i=∏ _{j ≠ i}a _j, A _ifor the set of actions of intelligent body i, s is current state, a _ifor the action of intelligent body i ,-i represents the set of other intelligent bodies except intelligent body i, and π is balance policy, Q _i(s, a) is the state-operating value function of intelligent body i;

Step 7, to all study unit j, upgrade all state-actions pair state-operating value function value and eligibility trace matrix and the balanced linkage strategy of random optimum asked under current state s by the Q value upgraded again by select each unit cooperative action, more new state s and action a;

Step 8, when upper once control cycle arrives, return step 4.

Control objectives selected zone departure ACE in described step 1 is minimum, cost of electricity-generating is minimum or Control performance standard CPS is the highest.

State discrete collection S in described step 2 specifically can by the power offset value of area control error ACE (k) of institute's control area electrical network, Control performance standard CPS (k) value and its each unit | Δ P _error-i| the scope of value divides to be determined.

Balance unit in described step 3 generally selects coal unit, and select the pondage such as water power and liquefied natural gas bound less but time ductility less, regulations speed is higher, adjustment expense is less unit participate in balanced study.

The expression formula of the teamwork discrete set A in described step 3 is:

A=A ₁×A ₂×…×A _i×…×A _n-1，

Wherein, A _ifor the output discrete movement collection of intelligent body i, n is intelligent body number.

Real-time running data in described step 4 is gathered by computer and supervisory control system.

In described step 5 R _ik () is generally walk the difference value of ACE and CPS1 and each power of the assembling unit deviate Δ P by institute's control area electrical network kth _error-ilinear combination to design.

Introduce the core concept of correlated equilibrium in described step 6, namely introduce the linear restriction of correlated equilibrium strategy and be suitable for the uCEQ equilibrium selection function of CPS instruction dynamic assignment optimization, make the coordination teamwork between intelligent body can reach optimum.

In described step 7 the iteration of value more new formula is:

Q_{j} (s, \overset{&RightArrow;}{a}) = Q_{j} (s, \overset{&RightArrow;}{a}) + α \times δ_{j} \times e_{j} (s, \overset{&RightArrow;}{a}),

In formula, for intelligent body j is in state-action pair state-operating value function, δ _jfor study deviate, for eligibility trace matrix;

δ_{j} = (1 - γ) \times R_{j} (s, \overset{&RightArrow;}{a}) + γ \times V_{j} (s^{,}) - Q_{j} (s, \overset{&RightArrow;}{a}), V_{i}^{t + 1} (s) = Σ π_{s}^{t} (a) Q_{i}^{t} (s, a),

In formula, γ is discount factor, and the span of γ is: 0≤γ≤1, and α is Studying factors, and the span of α is: 0≤α≤1, for the award value that intelligent body j obtains after current state s performs an action a, V _j(s ') for intelligent body j is at the value function of NextState s ', Q _i ^t(s, a) for t intelligent body i in state-action to (s, state a)-operating value function, π _s ^ta () is balance policy, V _i ^t+1s () is for t+1 moment intelligent body i is at the value function of state s.

Eligibility trace matrix in described step 7 the iteration of value more new formula is:

e_{j} (s, \overset{&RightArrow;}{a}) = γ \times λ \times e_{j} (s, \overset{&RightArrow;}{a}),

In formula, for eligibility trace matrix, γ is discount factor, and the span of γ is: 0≤γ≤1, and λ is decay factor, and the span of λ is: 0≤λ≤1.

Concrete scheme of the present invention comprises the following steps:

1, selected control objectives;

The target that allocating task controls has multiple choices, has that area control error (AreaControlError, ACE) is minimum, cost of electricity-generating is minimum and CPS index is the most high.Specifically describe as follows:

\{\begin{matrix} \min E = Σ_{t = 1}^{T} e (t) \\ s . t . Δ P_{order - Σ} (t) = Σ_{i = 1}^{n} Δ P_{order - i} (t) \\ 0 \leq Δ P_{order - i} (t) - Δ P_{order - i} (t - 1) \leq P_{rate}^{+} \\ P_{rate}^{-} \leq Δ P_{order - i} (t) - Δ P_{order - i} (t - 1) \leq 0 \\ Δ P_{Gi}^{\min} \leq Δ P_{Gi} (t) \leq Δ P_{Gi}^{\max} \end{matrix},

In formula: t is discrete instants; E is the variance between control objectives value and working control export; E be one about the cumulative variance of e in time period T; P _order-Σfor AGC system CPS command value, MW; P _order-ifor being assigned to the regulating command of i-th unit, MW; be the rising adjustment rate limit of i-th unit, MW/min; be the decline regulations speed restriction of i-th unit, MW/min; P _gibe that the practical adjustments of i-th unit is exerted oneself, MW; be respectively i-th unit pondage upper and lower limit, MW.

2, selected balance unit and teamwork space;

Be subject to the constraint that total instruction is known, as long as so carry out Cooperative Study to the n-1 class unit in n class AGC unit.Namely the CPS instruction regulated quantity of the n-th class unit is:

Δ P_{order - n} (t) = Δ P_{order - Σ} (t) - Σ_{i = 1}^{n - 1} Δ P_{order - i} (t),

The inventive method defines the n-th unit for balance unit.

Determine teamwork discrete set A, wherein A=A ₁× A ₂× ... × A _i× ... × A _n-1, A _ifor the output discrete movement collection of intelligent body i;

3, the solving of correlated equilibrium;

In markov decision process, each intelligent body maximizes respective jackpot prize value when not relying on other intelligent body action probability distribution, and now formed dynamic balance state is Nash Equilibrium.Correlated equilibrium is then contrary, and it is the dynamic equilibrium point that the action probability distribution depending on other intelligent body when each intelligent body maximizes oneself award value is formed.Correlated equilibrium mathematical description is:

\underset{a_{- i} &Element; A_{- i}}{Σ} π (a_{- i}, a_{i}) R_{i} (a_{- i}, a_{i}) &GreaterEqual; \underset{a_{- i} &Element; A_{- i}}{Σ} π (a_{- i}, a_{i}) R_{i} (a_{i}, a_{i}^{'}),

In formula: A _-i=∏ _{j ≠ i}a _j, π is balance policy, R _ifor the reward function immediately of intelligent body i.If a certain tactful π is for all intelligent body i, everything a _i, a _-i∈ A _i(π (a _i) >0) above formula all sets up, this strategy is correlated equilibrium dynamic equilibrium point.Correlated equilibrium can be asked for by linear programming is simple and easy.For one there is n intelligent body, Markov countermeasure (MarkovGames, MG) that each intelligent body has m action, its action is to total total m ⁿindividual, nm (m-1) is individual altogether in the linear restriction of above formula.

4, CEQ (λ) multiple agent Cooperative Study algorithm;

Given all intelligent body i ∈ N, all state s ∈ S and action a ∈ A (s) are at the Q value of moment t: Q _i ^t(s, a); Given balance policy π ^t; Given equilibrium selection function f; Under correlated equilibrium condition, by the value function Q of MG rule definable moment t+1 intelligent body i _i ^t+1(s, a) and V _i ^t+1(s):

\begin{matrix} V_{i}^{t + 1} (s) = \underset{a &Element; A (s)}{Σ} π_{s}^{t} (a) Q_{i}^{t} (s, a) \\ Q_{i}^{t + 1} (s, a) = (1 - γ) R_{i} (s, a) + γ \underset{s^{'} &Element; S}{Σ} P [s^{'} | s, a] V_{i}^{t + 1} (s^{'}) \\ π_{s}^{t + 1} &Element; f (Q^{t + 1} (s)) \end{matrix},

The linear restriction of correlated equilibrium strategy is described as all intelligent body i, everything a _i, a _-i∈ A _i(π (a _i) >0) following formula all sets up:

\underset{a_{- i} &Element; A_{- i}}{Σ} π_{s} (a) Q_{i} (s, a) &GreaterEqual; \underset{a_{- i} &Element; A_{- i}}{Σ} π_{s} (a) Q_{i} (s, (a_{- i}, a_{i}^{′})),

The correlated equilibrium strategy meeting above formula increases along with increasing of intelligent body, by solving the optimum linkage strategy of above formula just can perform the optimum action of associating between AGC unit.

In addition, the equilibrium selection function f that the inventive method uses is the uCEQ that GreenwaldA, HallK, ZinkevichM mention at " CorrelatedQ-learning ", that is:

f = \max_{π_{s} &Element; CE} \underset{i &Element; N}{Σ} \underset{\overset{&RightArrow;}{a} &Element; A (s)}{Σ} π_{s} (\overset{&RightArrow;}{a}) Q_{i} (s, \overset{&RightArrow;}{a}),

UCEQ physical significance is for maximizing all intelligent body remuneration sums, can the consideration value of fair " treating " every class AGC unit, raising regional power grid CPS is examined to qualification rate and reduces CPS power adjustments deviation, is applicable in the very high CPS instruction dynamic allocation procedure of requirement of real-time.

The present invention has following advantage and effect relative to prior art:

1, the unit output under CEQ (λ) algorithm between each unit more continuously, steadily, thus make unit actual power total amount and CPS1 curve smoother.

2, the teamwork space under CEQ (λ) algorithm between each unit is larger, thus can find more excellent isostatic equilibrium point, effectively can improve the examination qualification rate of CPS.

3, the proportion that under CEQ (λ) algorithm, coal group of motors bears load disturbance is comparatively large, and the pondage simultaneously by Hydropower Unit affects less, is more applicable for that coal electricity is dominant, the interconnected network CPS instruction dynamic assignment of hydroelectric resources scarcity.

Accompanying drawing explanation

Fig. 1 is AGC system load dynamic optimization assigning process.

Fig. 2 is CEQ (λ) control decision process.

Fig. 3 is two regional internet system loading frequency control model.

Embodiment

Below in conjunction with embodiment and accompanying drawing, the present invention is described in further detail, but embodiments of the present invention are not limited thereto.

Embodiment

The present embodiment is using the LOAD FREQUENCY Controlling model of typical IEEE two regional internet system as research object, 1 unit simulation generating link is only had in master mould, first select to learn emulation in advance in a-quadrant in this example, so use 3 kinds of unit models to substitute 1 original unit in a-quadrant, be respectively coal-fired, liquefied natural gas (liquefiednaturalgas, and Hydropower Unit LNG), B region still uses 1 original unit model, concrete model parameter and design of Simulation principle refer to Yu Tao, Wang Yuming, " interconnected network CPS regulating command dynamic optimal distributes Q-learning algorithm " (Proceedings of the CSEE) that Liu Qianjin delivers, as shown in Figure 3.To improve CPS qualification rate for control objectives, in this simulation model a-quadrant, add load disturbance, comprise load cycling disturbance and randomness load disturbance.The CPS power divider of regional power grid A is at the power offset value Δ P of comprehensive this regional power grid ACE, CPS1 instantaneous value and each unit _error-iseek optimal joint action policy between each unit, wherein AGC control cycle is chosen as 8s, uses Simulink to carry out modeling and simulating research.

As shown in Figure 1, at each AGC control cycle, grid dispatching center is by EMS (energymanagementsystem, EMS) the SCADA database in obtains CPS instantaneous value, power plants generating electricity plan and history correlation values, and be sent to CPS controller, calculate the regulated quantity CPS instruction P that unit is total _order-Σ.Control centre is again in conjunction with (mainly CPS1, ACE and the Δ P such as actual operating state and grid condition of each unit _error-iinstantaneous value) by multiple agent CEQ (λ) by total command assignment to all kinds of AGC unit, the P thus the target adjustment obtaining each unit is exerted oneself _order-i.CPS instruction is sent to power plants generating electricity control unit by the information transmission system.Meanwhile, the actual adjustment of unit is exerted oneself P by power plant _giand relevant operation information is delivered in the EMS system of grid dispatching center by the information transmission system.

The inventive method CEQ (λ) adopted in CPS directive distributor can make up the shortcoming lacking linkage strategy optimizing in traditional intelligence Generation Control in regional power grid between unit, by obtaining the actual power generation of the ACE instantaneous value of regional power grid, CPS rolling mean value and each unit, seek optimal joint action policy online to make CPS long-term gain maximum.As shown in Figure 2, CEQ (λ) control decision process is divided into three phases:

1) the Q value matrix under iteration renewal current state and eligibility trace matrix e (s);

2) under given equalization target function uCEQ, linear programming for solution correlated equilibrium is passed through;

3) optimal joint action policy is performed, and observing system response, return award value and current state.

The design of this control method is very little by the impact of regional power grid and all kinds of unit model, and the characteristic of its multiple agent automatic measure on line is highly suitable for uncertain AGC stochastic system.

CPS instruction dynamic assignment optimal control method under CEQ (λ) multiple agent Cooperative Study is as follows:

1) be up to control objectives with CPS index, the constraint of all kinds of unit see remaining great waves, " interconnected network CPS regulating command dynamic optimal distribute Q-learning algorithm " that Wang Yuming, Liu Qianjin deliver;

2) power of the assembling unit real-time offsets determination discrete state collection S is analyzed: this example is incited somebody to action | Δ P _error-i| value is divided into 10 states: [0,5), [and 5,10), [10,20), [20,50), [50,100), [100,200), [200,500), [500,1000), [1000,1500), [1500 ,+∞), each study unit can define 10 states thus;

3) select coal unit as balance unit, LNG unit and Hydropower Unit participate in correlated equilibrium intensified learning, wherein output action discrete set A1=A2={-100-50-20-10-505102050100}MW, and teamwork value number has A=A ₁× A ₂=11 × 11=121, correlated equilibrium constraint equation always has 2 × 11 × (11-1)=220.

4) collection institute's control area electrical network and the real-time running data of all AGC adjustment unit when each control cycle starts is measured: Δ f, Δ P, Δ P _gi, wherein Δ f represents system frequency deviation, and Δ P represents dominant eigenvalues deviation, Δ P _girepresent that the actual adjustment of i-th unit is exerted oneself; According to international evaluation method ACE=T _a-T _s-10B (F _a-F _s) (T _a, T _sbe respectively the actual trend value of interconnection and expect trend value; B is frequency bias coefficient; F _a, F _sbe respectively system actual frequency value and expected frequency value), (B is the frequency bias coefficient of this regional power grid; ε ₁for interconnected network is to annual 1 minute root mean square control objectives value of frequency averaging deviation; N is the number of minutes of this examination period), CPS1=(2-CF1) × 100%, (ε ₁₀for interconnected network is to annual 10 minutes root mean square control objectives values of frequency averaging deviation; B _netfrequency bias coefficient for whole interconnected network), the ACE (k) in this region and CPS (k) instantaneous value is calculated with formula CPS2=(1-R) × 100%;

5) ACE (k), the CPS (k) of CPS gross power distributor according to this regional power grid and the Δ P of each unit _error-i(k) instantaneous value determination current state s, then the R of award value immediately being obtained each unit by state s _i(k), reward function design is as follows:

\{\begin{matrix} \begin{matrix} R_{i} (k) = η_{i}, & η_{i} &GreaterEqual; 0 {, C}_{CPS 1} \end{matrix} (k) &GreaterEqual; 200 \\ R_{i} (k) = 10 \times [E_{ACE} (k) - E_{ACE} (k - 1)] - Δ P_{error - i}^{2} (k), \\ E_{ACE} (k) \leq 0 \cup C_{CPS 1} (k) &Element; [100,200) \\ R_{i} (k) = 10 \times [E_{ACE} (k - 1) - E_{ACE} (k)] - Δ P_{error - i}^{2} (k), \\ E_{ACE} (k) > 0 \cup C_{CPS 1} (k) &Element; [100,200) \\ R_{i} (k) = 20 \times [C_{CPS 1} (k) - C_{CPS 1} (k - 1)] - 2 \times Δ P_{error - i}^{2} (k), \\ E_{ACE} (k) \leq 0 \cup C_{CPS 1} (k) < 100 \\ R_{i} (k) = 20 \times [C_{CPS 1} (k - 1) - C_{CPS 1} (k)] - 2 \times Δ P_{error - i}^{2} (k), \\ E_{ACE} (k) > 0 \cup C_{CPS 1} (k) < 100 \end{matrix}

In formula: η _ifor unit i history rewards maximum, be initially 0; E _aCE(k) and C _cPS1k () is respectively CPS1 and the ACE instantaneous value of regional power grid kth step iteration; Δ P _error-itarget adjustment for unit i is exerted oneself Δ P _order-ito exert oneself Δ P with reality adjustment _gidifference, i.e. Δ P _error-i(k)=Δ P _order-i(k-1)-Δ P _gi(k);

6) by linear equalization

\underset{a_{- i} &Element; A_{- i}}{Σ} π_{s} (a) Q_{i} (s, a) &GreaterEqual; \underset{a_{- i} &Element; A_{- i}}{Σ} π_{s} (a) Q_{i} (s, (a_{- i}, a_{i}^{′}))

(π is the linkage strategy under state s) and equilibrium selection function ask for the optimum linkage strategy of correlated equilibrium

7) to all study unit j, perform:

1. state value function is upgraded

V_{i}^{t + 1} (s) = Σ π_{s}^{t} (a) Q_{i}^{t} (s, a);

2. estimated value function error

δ_{j} = (1 - 0.2) \times R_{j} (s, \overset{&RightArrow;}{a}) + 0.2 \times V_{j} (s^{,}) - Q_{j} (s, \overset{&RightArrow;}{a});

3. eligibility trace element is upgraded

e_{j} (s, \overset{&RightArrow;}{a}) = e_{j} (s, \overset{&RightArrow;}{a}) + 1;

4. to all state-actions pair perform:

◆ upgrade Q value function

Q_{j} (s, \overset{&RightArrow;}{a}) = Q_{j} (s, \overset{&RightArrow;}{a}) + 0.2 \times δ_{j} \times e_{j} (s, \overset{&RightArrow;}{a});

◆ upgrade eligibility trace matrix

e_{j} (s, \overset{&RightArrow;}{a}) = 0.2 \times 0.4 {\times e}_{j} (s, \overset{&RightArrow;}{a});

5. if current state s and NextState s ' is same state, then ask for Stochastic Equilibrium interlock optimal policy by renewal Q value at this;

6. by optimum equalization linkage strategy select each unit cooperative action;

⑦s=s'，

8) when upper once control cycle arrives, step 4) is returned.

Core of the present invention is the balance selection of unit, the improvement of motion space, the design of reward function, optimum coordination strategy solve and the Q value matrix of each unit upgrades.Wherein balance the introducing of unit, the expansion of optimizing motion space, linear restriction and the equilibrium selection function of correlated equilibrium strategy are key innovations, the enforcement of this method and correlation technique thereof, the multicomputer power division in regional power grid is made to be in the state of optimum coordination all the time, state and the action of whole intelligent body are depended in the action of each intelligent body, improve the ability of collaborative power adjustments between each unit, the frequent adjustment number of times of all kinds of unit of effective reduction, be specially adapted to coal electricity be dominant, the interconnected network CPS instruction dynamic optimal of Unit Combination complexity distributes, effectively improve the adaptability of system, robustness and CPS examine qualification rate.

The application of CEQ (λ) method in CPS command assignment that the present invention proposes mainly includes: the selection of control objectives, the design of reward function, balance unit and the determination of motion space, the introducing of eligibility trace, the solving of the selection of balance function and correlated equilibrium.The method that CEQ (λ) newly proposes as the present invention, does not also have example to be applied to the very high Complex Nonlinear System of the such requirement of real-time of electric power system.

Control method of the present invention can completely be described below:

(1) selection of control objectives: the target that allocating task controls has multiple choices, has that area control error (AreaControlError, ACE) is minimum, cost of electricity-generating is minimum and CPS index is the most high.Specifically describe as follows:

\{\begin{matrix} \min E = Σ_{t = 1}^{T} e (t) \\ s . t . Δ P_{order - Σ} (t) = Σ_{i = 1}^{n} Δ P_{order - i} (t) \\ 0 \leq Δ P_{order - i} (t) - Δ P_{order - i} (t - 1) \leq P_{rate}^{+} \\ P_{rate}^{-} \leq Δ P_{order - i} (t) - Δ P_{order - i} (t - 1) \leq 0 \\ Δ P_{Gi}^{\min} \leq Δ P_{Gi} (t) \leq Δ P_{Gi}^{\max} \end{matrix},

(2) ACE (k) of this regional power grid, CPS (k) value and unit is analyzed | Δ P _error-i| value determines discrete state collection S;

(3) determine balance unit and motion space: generally select the pondage such as water power and liquefied natural gas bound less but time ductility less, that regulations speed is higher, adjustment expense is less unit participate in balanced study, and balance unit and generally select coal unit.In addition, the teamwork discrete set A=A of the present invention's proposition ₁× A ₂× ... × A _i× ... × A _n-1, A _ifor the output discrete movement collection of intelligent body i.

(4) collection institute's control area electrical network and the real-time running data of all AGC adjustment unit when each control cycle starts is measured: Δ f, Δ P, Δ P _gi, and calculate the ACE (k) in this region and the instantaneous value of CPS (k), wherein Δ f represents system frequency deviation, and Δ P represents dominant eigenvalues deviation, Δ P _girepresent that the actual adjustment of i-th unit is exerted oneself;

(5) according to ACE (k), the CPS (k) of this regional power grid and the Δ P of each unit _error-i(k) instantaneous value determination current state s, then the R of award value immediately being obtained each unit by state s _i(k), R _ithe difference value being generally designed to this regional power grid kth step ACE and CPS1 of (k) and Δ P _error-ithe linear combination of value.

(6) by linear equalization

\underset{a_{- i} &Element; A_{- i}}{Σ} π_{s} (a) Q_{i} (s, a) &GreaterEqual; \underset{a_{- i} &Element; A_{- i}}{Σ} π_{s} (a) Q_{i} (s, (a_{- i}, a_{i}^{′}))

(π is the linkage strategy under state s) and equilibrium selection function ask for the optimum linkage strategy π of correlated equilibrium _s ^*;

(7) to all study unit j, perform:

1. state value function is upgraded

V_{i}^{t + 1} (s) = Σ π_{s}^{t} (a) Q_{i}^{t} (s, a);

2. estimated value function error

δ_{j} = (1 - γ) \times R_{j} (s, \overset{&RightArrow;}{a}) + γ \times V_{j} (s^{,}) - Q_{j} (s, \overset{&RightArrow;}{a}),

Wherein γ is discount factor, 0≤γ≤1;

3. eligibility trace element is upgraded

e_{j} (s, \overset{&RightArrow;}{a}) = e_{j} (s, \overset{&RightArrow;}{a}) + 1;

4. to all state-actions pair perform:

Upgrade Q value function

Q_{j} (s, \overset{&RightArrow;}{a}) = Q_{j} (s, \overset{&RightArrow;}{a}) + α \times δ_{j} \times e_{j} (s, \overset{&RightArrow;}{a}),

Wherein α is Studying factors, 0≤α≤1;

Upgrade eligibility trace matrix wherein λ is decay factor, 0≤λ≤1;

⑦s=s'，

(8) when upper once control cycle arrives, step (4) is returned.

Above-described embodiment is the present invention's preferably execution mode; but embodiments of the present invention are not restricted to the described embodiments; change, the modification done under other any does not deviate from Spirit Essence of the present invention and principle, substitute, combine, simplify; all should be the substitute mode of equivalence, be included within protection scope of the present invention.

Claims

1. an interconnected network CPS instruction dynamic assignment optimization method, is characterized in that, comprise the following steps:

Step 1, determine control objectives;

Step 2, determine state discrete collection S;

Step 6, to be retrained by linear equalization

\underset{a_{- i} &Element; A_{- i}}{Σ} π_{s} (a) Q_{i} (s, a) &GreaterEqual; \underset{a_{- i} &Element; A_{- i}}{Σ} π_{s} (a) Q_{i} (s, (a_{- i}, a_{i}^{'}))

Wherein, A _-i=∏ _{j ≠ i}a _j, A _ifor the output discrete movement collection of unit i, s is current state, a _ifor the action of unit i ,-i represents the set of other intelligent bodies except unit i, and π is balance policy, Q _i(s, a) is the state-operating value function of unit i;

Step 8, when upper once control cycle arrives, return step 4;

In described step 7 the iteration of value more new formula is:

Q_{j} (s, \overset{&RightArrow;}{a}) = Q_{j} (s, \overset{&RightArrow;}{a}) + α \times δ_{j} \times e_{j} (s, \overset{&RightArrow;}{a}),

\begin{matrix} δ_{j} = (1 - γ) \times R_{j} (s, \overset{&RightArrow;}{a}) + γ \times V_{j} (s^{,}) - Q_{j} (s, \overset{&RightArrow;}{a}), & V_{i}^{t + 1} (s) = {Σπ}_{s}^{t} (a) Q_{i}^{t} (s, a), \end{matrix}

In formula, γ is discount factor, and the span of γ is: 0≤γ≤1, and α is Studying factors, and the span of α is: 0≤α≤1, for the award value that intelligent body j obtains after current state s performs an action a, V _j(s ') for intelligent body j is at the value function of NextState s ', for t unit i in state-action to (s, state a)-operating value function, π _s ^ta () is balance policy, for t+1 moment unit i is at the value function of state s.

2. interconnected network CPS instruction dynamic assignment optimization method as claimed in claim 1, it is characterized in that, the control objectives selected zone departure ACE in described step 1 is minimum, cost of electricity-generating is minimum or Control performance standard CPS is the highest.

3. interconnected network CPS instruction dynamic assignment optimization method as claimed in claim 1, it is characterized in that, the state discrete collection S in described step 2 specifically can by the absolute value of each power of the assembling unit deviate of area control error ACE (k) of institute's control area electrical network, Control performance standard CPS (k) value and its each unit | Δ P _error-i| scope divide determine.

4. interconnected network CPS instruction dynamic assignment optimization method as claimed in claim 1, it is characterized in that, balance unit in described step 3 selects coal unit, and select water power and liquefied natural gas pondage bound less but time ductility less, regulations speed is higher, adjustment expense is less unit participate in balanced study.

5. interconnected network CPS instruction dynamic assignment optimization method as claimed in claim 1, it is characterized in that, the expression formula of the teamwork discrete set A in described step 3 is:

A＝A ₁×A ₂×…×A _i×…×A _n-1，

Wherein, A _ifor the output discrete movement collection of unit i, n is intelligent body number.

6. interconnected network CPS instruction dynamic assignment optimization method as claimed in claim 1, it is characterized in that, the real-time running data in described step 4 is gathered by computer and supervisory control system.

7. interconnected network CPS instruction dynamic assignment optimization method as claimed in claim 1, is characterized in that, in described step 5 R _ik () walks the difference value of ACE and CPS1 and each power of the assembling unit deviate Δ P by institute's control area electrical network kth _error-ilinear combination design, CPS1 be unit actual power total amount and.

8. interconnected network CPS instruction dynamic assignment optimization method as claimed in claim 1, it is characterized in that, introduce the linear restriction of correlated equilibrium strategy in described step 6 and be suitable for the uCEQ equilibrium selection function of CPS instruction dynamic assignment optimization, make the coordination teamwork between intelligent body reach optimum, uCEQ physical significance is for maximizing all intelligent body remuneration sums.

9. interconnected network CPS instruction dynamic assignment optimization method as claimed in claim 1, is characterized in that, the eligibility trace matrix in described step 7 the iteration of value more new formula is:

e_{j} (s, \overset{&RightArrow;}{a}) = γ \times λ \times e_{j} (s, \overset{&RightArrow;}{a}),