CN106372366A

CN106372366A - Intelligent power generation control method based on hill-climbing algorithm

Info

Publication number: CN106372366A
Application number: CN201610866538.XA
Authority: CN
Inventors: 席磊; 陈建峰; 杨苹; 许志荣; 柳浪; 李玉丹
Original assignee: China Three Gorges University CTGU
Current assignee: China Three Gorges University CTGU
Priority date: 2016-09-30
Filing date: 2016-09-30
Publication date: 2017-02-01

Abstract

The invention discloses an intelligent power generation control method based on a hill-climbing algorithm. The method comprises the following steps: determining a state discrete set S; determining a joint action discrete set A; collecting real-time operating data of each power grid, such as frequency deviation Delta f and power deviation Delta P, at the beginning of each control period, and calculating an instantaneous value of a control error ACEi(k) in each area and an instantaneous value of control performance standard CPSi(k); determining the current state S, and acquiring a short-term award function signal Ri(k) of a power grid i in a certain area according to the current state S and an award function; calculating and estimating earned value function errors pk and delta k; solving the target value function and strategy through the function; performing corresponding operations on the power grids j in all areas; returning to the step 3. According to the method disclosed by the invention, the optimal averaging strategy can be obtained in the control process, the performance of a closed-loop system is excellent, the problem of automation power generation coordination control in a complex interconnected power system environment brought by new energy power connection can be solved; compared with the conventional intelligent algorithm, the method disclosed by the invention has high learning ability and high convergence rate.

Description

A kind of intelligent power generation control method based on wolf hill-climbing algorithm

Technical field

The present invention relates to a kind of Power System Intelligent Generation Control technology, particularly to a kind of intelligence based on wolf hill-climbing algorithm Can electricity-generating control method.

Background technology

Modern power network has been developed as the interacted system of many control areas on the basis of electricity market mechanism, and interconnects Electrical network Automatic Generation Control (automatic generation control, agc) is most basic in energy management system One of function, be to ensure that the basic means of power system active power balance and frequency stable, its control effect directly affects Electrical network quality.In interconnected electric power system, its dominant eigenvalues deviation is as the change of user side load with frequency change And change.By controlling the change at random of active follow load of exerting oneself of electromotor and improving frequency quality of power grid, it is to work as A modern hot issue controlling research field.Automatic Generation Control is built upon using scheduling monitor computer, passage, a distant place The closed-loop control system of the compositions such as terminal, execution (distribution) device, generating set automation device.It is electric power system dispatching from One of main contents of dynamicization.

At present, under the overall background that intelligent grid is greatly developed, exploitation has independent learning ability and power plants and grid coordination ability Intelligent power generation control, progressively become a kind of main trend.And in recent years, multiple agent nitrification enhancement becomes The one big focus in machine learning field.The algorithm frame system being based particularly on classical q study is constantly enriched and is developed.And In its research field, having had many application examples to demonstrate each intelligent body in multiple agent intensified learning can chase after The decision-making of track other intelligent body is with dynamic coordinate itself action.So, several based on game theory, and with q learning method Lai Realize Distributed Reinforcement Learning method proposed successively, wherein more famous such as: minimax-q, nash-q and friend- or-foe q.Yet with minimax_q be zero-sum game, nash-q take up room greatly, the agent of ff-q must be known by other Agent be enemy be friend so that ff-q only has the defects such as individual rationality, limit the application of these algorithms.

Then, a kind of be suggested based on distributed multi agent learning algorithm dceq (λ) algorithm of correlated equilibrium, use To solve interconnected network agc Harmonic Control, and achieve relatively satisfactory control effect.But, increase in intelligent body number Added-time, dceq (λ) algorithm is in that geometry number increases in the search multiple agent equilibrium solution time, limits its method more massive Extensively apply in network system.The hill climbing that bowling＆veloso developed " win " or " Fast Learning " in 2002 is calculated Method；In study, each agent using mixed strategy and only preserves the q value table of itself.So, on the one hand, it avoids typically Need the exploration solving in q study and utilize this contradictory problems；On the other hand, it can solve many agent system asynchronous certainly Question and answer on politics is inscribed.Based on this it is proposed that distributed wolf-phc (λ) algorithm, i.e. wolf hill-climbing algorithm.Its merged wolf-phc algorithm, Eligibility trace and sarsa algorithm, and the method is applied to solve the equilibrium solution during multiple agent intelligent power generation controls.Standard two area Electric power system model and the effectiveness of two case studies this algorithm verified of southern pessimistic concurrency control that domain LOAD FREQUENCY controls.By Change with environmental suitability in wolf learning rate, compared with other intelligent power generation control methods, wolf hill-climbing algorithm has quickly Rate of convergence.

For wolf hill-climbing algorithm, the information that each region intelligent body will not reduce and other intelligent bodies between exchanges, but At every moment perceive the state change that the action of other intelligent bodies causes.Control system is multi-agent system, each region All embedded in wolf hill-climbing algorithm, compared with ceq algorithm, seem the same single intelligent body algorithm of q study, only have in each algorithm One intelligent body, other intelligent body actions can produce impact to current state and subsequent time state, and this is namely so-called Intelligent body teamwork, and intelligent body can change learning rate at any time with the change of state, this namely wolf climb the mountain than q study Superior place.It is true that it is as many in hereinbefore cited minimax-q, nash-q, friend-or-foe q and dceq etc. Intelligent body learning algorithm is inherently belonging to the game between multiple agent, can be summarized as Nash Equilibrium game.But no It is same as Static Game scene, for the control process belonging to dynamic game, Nash Equilibrium Solution is in searching that each control time is spaced Suo Sudu might not meet realtime control requirement.The wolf hill climbing method being proposed is to replace many intelligence by Average Strategy Can the equilibrium point of body dynamic game solve, therefore from the viewpoint of game theory, wolf hill climbing method can be regarded as a kind of efficiently, Self independent game, reduces real time information and other intelligent bodies between and exchanges and jointly control the solution difficulty of strategy.Always For, wolf hill-climbing algorithm energy effectively solving Stochastic Game solves and the application problem in non-Markovian environment.And pass through The a kind of of stochastic dynamic game suitably wins defeated standard, introduces learning rate changing and Average Strategy, can improve wolf and climb the mountain dynamic Energy.Electric power system model and southern pessimistic concurrency control are controlled based on standard two region LOAD FREQUENCY, intelligence has been carried out to multi-intelligence algorithm The simulation example research that Generation Control is coordinated.Simulation result shows, wolf climbs the mountain and is obtained in that quickly compared with other intelligent algorithms Convergence property and the learning efficiency, multizone by force at random interconnection complex electric network environment under there is high degree of adaptability and robustness.

Content of the invention

The present invention provides a kind of intelligent power generation control method based on wolf hill-climbing algorithm, can obtain in control process Excellent Average Strategy, Performance of Closed Loop System is excellent, can solve the interconnection complicated electric power system ring that new forms of energy plant-grid connection is brought Automatic generation Harmonic Control under border；Compared with existing intelligent algorithm, there is higher learning capacity and Fast Convergent speed.

The technical solution adopted in the present invention is:

A kind of intelligent power generation control method based on wolf hill-climbing algorithm, comprises the following steps:

Step 1: determine state discrete collection s；

Step 2: determine teamwork discrete set a；

Step 3: when each controlling cycle starts, gather the real-time running data of each electrical network, described real time execution number According to including frequency departure △ f and power deviation △ p, calculate regional and control error ace_iThe instantaneous value of (k) and control performance Standard cps_iThe instantaneous value of (k)；

Step 4, determine current state s, then by current state s and reward function obtain one of certain regional power grid i short Phase reward function signal r_i(k)；

Step 5, by calculate with estimate obtain value function error p_k、δ_k；

Step 6, ask for optimal objective value function and strategy by function；

Step 7, to all regional power grid j, update all state-actions to (s, q function form a) and eligibility trace matrix e_j(s a), and updates mixed strategy u under current state s by the q value updating_k(s_k,a_k), then by mixed strategy u_k(s_k,a_k) more New value function q_k+1(s_k,a_k), eligibility trace element e (s, a), learning rate changing φ and average mixed strategy table；

Step 8, return to step 3.

The state discrete collection s of described step 1, is determined by the division of the value of control performance standard cps1/cps2.

In described step 2, according to motion blurization rule, determine interval action.

The real-time running data of described step 3, is gathered using computer and monitoring system.

In described step 3, the area control error ace of described region i_iK the instantaneous value calculating method of () is as follows:

Ace=t_a-t_s-10b(f_a-f_s),

Wherein, t_aFor interconnection actual trend value, t_sExpect trend value for interconnection, b is frequency bias coefficient, f_aFor being System actual frequency values, f_sFor system expected frequency value；

The cps of the control performance standard 1 of described region i_iK the instantaneous value calculating method of () is as follows:

Cps1=(2-cf1) × 100%,

Wherein,b_iFrequency bias coefficient for control area i；ε₁For interconnection electricity Net root mean square control targe value to 1 minute whole year frequency averaging deviation；N is the number of minutes of this examination period；ace_ave-1minFor Meansigma methodss in 1 minute for the area control error ace；△f_aveFor meansigma methodss in 1 minute for the frequency departure △ f；

The cps of the control performance standard 2 of described region i_iK the instantaneous value calculating method of () is as follows:

Cps2=(1-r) × 100%,

Wherein,

ε₁₀For interconnected network to annual 10 minutes frequency averaging deviations Root mean square control targe value；b_netFrequency bias coefficient for whole interconnected network；ace_ave-10minFor area control error ace Meansigma methodss in 10 minutes.

The short-term reward function signal r of described step 4_iK () is by following formula obtained by, formula is as follows:

\{\begin{matrix} \begin{matrix} r_{i} (s_{k - 1}, s_{k}, a_{k - 1}) = σ_{i} - μ_{1 i} {δp}_{i} {(k)}^{2} & {cpsl}_{i} (k) &greaterequal; 200 \end{matrix} \\ r_{i} (s_{k - 1}, s_{k}, a_{k - 1}) = - η_{1 i} [| {ace}_{i} (k) | - | {ace}_{i} (k - 1) |] - μ_{1 i} {δp}_{i} {(k)}^{2} \\ {cpsl}_{i} (k) &element; [100, 200) \\ r_{i} (s_{k - 1}, s_{k}, a_{k - 1}) = - η_{2 i} [\begin{matrix} | {cpsl}_{i} (k) - 200 | - \\ | {cpsl}_{i} (k - 1) - 200 | \end{matrix}] - μ_{2 i} {δp}_{i} {(k)}^{2} \\ {cpsl}_{i} (k) < 100 \end{matrix},

Wherein, r_i(s_k-1,s_k,a_k-1) it is in selected action a_k-1Lower state is from s_k-1To s_kIntelligent body reward function, ace_i(k) and cps1_iK () is respectively the instantaneous value that regional power grid i kth walks ace and cps1 of iteration, σ_iReward for region i history Maximum.

Value function error p of described step 5_k、δ_kBy formula:

p_k=r (s_k,s_k+1,a_k)+γq_k(s_k+1,a_g)-q_k(s_k,a_k)

And δ_k=r (s_k,s_k+1,a_k)+γq_k(s_k+1,a_g)-q_k(s_k,a_k)

Obtained, wherein, r (s_k,s_k+1,a_k) it is in selected action a_kLower state is from s_kTo s_k+1Intelligent body reward function, γ is discount factor, and the span of γ is 0 < γ < 1, a_gFor greedy action policy.

In described step 6, optimal objective value functionWith tactful π^*S () is

v^{π^{*}} (s) = \underset{a &element; a}{m a x} q (s, a)

π^{*} (s) = \arg \underset{a &element; a}{m a x} q (s, a)

In formula, a is behavior aggregate.

In described step 7, by formula:

e_k+1(s,a)←γλe_k(s,a)

Update eligibility trace matrix, according to formula:

q_k+1(s, a)=q_k(s,a)+αδ_ke_k(s,a)

Update q function form, wherein, e_k(s is a) eligibility trace that kth walks iteration under state s action a, γ is discount The factor, the span of γ is mark decay factor for 0 < γ < 1, λ, and the span of λ is q learning rate for 0 < λ < 1, α, and α arranges model Enclose for 0 < α < 1.

Mixed strategy u in described step 7_k(s_k,a_k) updated according to following formula:

In formula, φ_iFor learning rate changing.

In described step 7, according to formula:

q_k+1(s_k,a_k)=q_k+1(s_k,a_k)+αp_k

Update value function q_k+1(s_k,a_k), according to formula:

e_{k + 1} (s, a) = \{\begin{matrix} {γλe}_{k} (s, a) + 1, & (s, a) = (s_{k}, a_{k}) \\ {γλe}_{k} (s, a) & o t h e r w i s e \end{matrix}

Update eligibility trace element e (s_k,a_k)←e(s_k,a_k)+1, according to formula:

Update learning rate changingAccording to formula:

\tilde{u} (s_{k}, a_{i}) &leftarrow; \tilde{u} (s_{k}, a_{i}) + (u (s_{k}, a_{i}) - \tilde{u} (s_{k}, a_{i})) / v i s i t (s_{k}), \forall a_{i} &element; a

Update average mixed strategy table, in formula,WithTwo learning parameters be used for representing the win of intelligent body with defeated, visit(s_k) by the s being experienced from original state to current state_kNumber of times.

A kind of intelligent power generation control method based on wolf hill-climbing algorithm of the present invention, hinge structure has such advantages as And effect:

1: in the design of the inventive method, its intelligent body can change learning rate at any time with the change of state, improve and be The dynamic property of system is so as to have higher Fast Convergent speed.

2: the inventive method is solved by the equilibrium point that Average Strategy replaces multiple agent dynamic game, reduces and other Between intelligent body, real time information exchanges and jointly controls the solution difficulty of strategy.

3: the inventive method is based on Average Strategy and mixed strategy so as in non-Markovian environment and long time delay system There is high degree of adaptability, and automatic generation under the interconnection complicated electric power system environment that new forms of energy plant-grid connection is brought can be solved Harmonic Control.

Brief description

Fig. 1 is agc MAS control framework.

Fig. 2 is south electric network LOAD FREQUENCY Controlling model figure.

Specific embodiment

A kind of intelligent power generation control method based on wolf hill-climbing algorithm, the framework of this intelligent power generation control method is by measuring intelligence Energy body, centralized Control intelligent body and decentralised control intelligent body three class intelligent body are formed, and this control framework adopts wolf hill-climbing algorithm Realize centralized Control and the decentralised control of agc respectively.Wolf hill-climbing algorithm be a kind of have multistep backtracking and learning rate changing many intelligence Energy body new algorithm, proposes for solving automatic generation Harmonic Control under interconnection complicated electric power system environment.This calculation Method, on the basis of wolf-phc, has merged sarsa (λ) and eligibility trace, can effectively solving Stochastic Game solve and in Fei Maer Can husband's environment application problem.Wolf hill-climbing algorithm learns compared to q, q (λ) study is calculated with multi-agent Learnings such as dceq (λ) Method its have faster convergence rate and the learning efficiency, multizone by force at random interconnection complex electric network environment under, have highly suitable Answering property and robustness.

The data input of test intelligent body is dominant eigenvalues deviation and the frequency departure in this region, is output as this region Control error amount and roll cps value.Afterwards, ace the and cps value in each region is transferred to concentration agc controller.If regional Data complete and concentrate agc controller normal work, then be output as the working value of regional, be cwolf- using method phc(λ)(centralized wolf-phc(λ))；Otherwise, Centralized Controller transmits all gathered datas to regional Dispersion agc controller.If data is complete, each dispersion agc controller distributes the action each calculating and is independent of each other；If number According to complete, the uneven region-wide last time normal data of data row calculating action value distribute dynamic again called in by each decentralized controller Make, be dwolf-phc (λ) (decentralized wolf-phc (λ)) using method.Whole interconnected network one and only one Concentrate agc controller, and measure intelligent body and disperse agc controller to have one in each regional power grid.

Method of the present invention cwolf-phc (λ), its control decision process is divided into three phases:

1), the state-action to all intelligent bodies updates its q value to using wolf hill-climbing algorithm；

2), draw optimum Average Strategy；

3), execute optimum Average Strategy, and viewing system response, return award value and current state.

Step 1: determine state discrete collection s；

Step 2: determine teamwork discrete set a；

Step 5, by calculate with estimate obtain value function error p_k、δ_k；

Step 6, ask for optimal objective value function and strategy by function；

Step 8, return to step 3.

Ace=t_a-t_s-10b(f_a-f_s),

Cps1=(2-cf1) × 100%,

Cps2=(1-r) × 100%,

Wherein,

ε₁₀Equal to annual 10 minutes frequency averaging deviations for interconnected network The control targe value of root；b_netFrequency bias coefficient for whole interconnected network；ace_ave-10minExist for area control error ace Meansigma methodss in 10 minutes.

\{\begin{matrix} \begin{matrix} r_{i} (s_{k - 1}, s_{k}, a_{k - 1}) = σ_{i} - μ_{1 i} {δp}_{i} {(k)}^{2} & {cpsl}_{i} (k) &greaterequal; 200 \end{matrix} \\ r_{i} (s_{k - 1}, s_{k}, a_{k - 1}) = - η_{1 i} [| {ace}_{i} (k) | - | {ace}_{i} (k - 1) |] - μ_{1 i} {δp}_{i} {(k)}^{2} \\ {cpsl}_{i} (k) &element; [100, 200) \\ r_{i} (s_{k - 1}, s_{k}, a_{k - 1}) = - η_{2 i} [\begin{matrix} | {cpsl}_{i} (k) - 200 | - \\ | {cpsl}_{i} (k - 1) - 200 | \end{matrix}] - μ_{2 i} {δp}_{i} {(k)}^{2} \\ {cpsl}_{i} (k) < 100 \end{matrix},

Value function error p of described step 5_k、δ_kBy formula:

p_k=r (s_k,s_k+1,a_k)+γq_k(s_k+1,a_g)-q_k(s_k,a_k)

And δ_k=r (s_k,s_k+1,a_k)+γq_k(s_k+1,a_g)-q_k(s_k,a_k)

In described step 6, optimal objective value functionWith tactful π^*S () is

v^{π^{*}} (s) = \underset{a &element; a}{m a x} q (s, a)

π^{*} (s) = \arg \underset{a &element; a}{m a x} q (s, a)

In formula, a is behavior aggregate.

In described step 7, by formula:

e_k+1(s,a)←γλe_k(s,a)

Update eligibility trace matrix, according to formula:

q_k+1(s, a)=q_k(s,a)+αδ_ke_k(s,a)

In formula, φ_iFor learning rate changing.

In described step 7, according to formula:

q_k+1(s_k,a_k)=q_k+1(s_k,a_k)+αp_k

Update value function q_k+1(s_k,a_k), according to formula:

e_{k + 1} (s, a) = \{\begin{matrix} {γλe}_{k} (s, a) + 1, & (s, a) = (s_{k}, a_{k}) \\ {γλe}_{k} (s, a) & o t h e r w i s e \end{matrix}

Update learning rate changingAccording to formula:

\tilde{u} (s_{k}, a_{i}) &leftarrow; \tilde{u} (s_{k}, a_{i}) + (u (s_{k}, a_{i}) - \tilde{u} (s_{k}, a_{i})) / v i s i t (s_{k}), \forall a_{i} &element; a

The operation principle of the present invention:

The present invention is the intelligent power generation control method based on wolf hill-climbing algorithm, and the main working process of the present invention is as follows: One controlling cycle gathers the real-time running data of regional power grid to be controlled when starting；Setting based on reward function and working as Front state, obtains reward function signal；Optimal objective value function and strategy are asked for by function；Update all control areas electrical network Q value, eligibility trace, learning rate changing and mixed strategy；Obtain up-to-date action.The present invention can obtain optimum flat in control process All strategies, and closed loop system is excellent, can solve under the interconnection complicated electric power system environment that new forms of energy plant-grid connection is brought certainly Dynamic generating Harmonic Control, has higher learning capacity and Fast Convergent speed compared with existing algorithm.Whole control Method does not need the mathematical model of external environment condition, system control performance index can be converted into a kind of evaluation index, work as system When performance satisfaction requires, receive awards；Otherwise, pay for.The study by itself for the controller, the control obtaining optimum is moved Make, be highly suitable for multizone interconnected network intelligent power generation system random by force.Relative theory of the present invention includes:

1.wolf principle:

Scholars have made intensive studies having the application in opponent's problem of the wolf principle of heuristic, Accelerate pace of learning when failure, when win, reduce pace of learning.With and contrary average of other intelligent body current strategies Strategy is compared, if a player prefers current strategies, or current expectation award is bigger than the equilibrium value of game, then Player just wins.But the player of wolf principle gives strict requirements to required knowledge, which has limited wolf Principle universality.

2.phc:

Hill climbing (policy hill-climbing, the phc) algorithm being proposed is the extension of wolf principle, so that its More universality, according to hill climbing algorithm, q study can obtain mixed strategy and preserve q value.Due to phc have rationality and Convergence property, when other intelligent bodies select fixed policy, it can obtain optimal solution.Part document verified by suitable Explore tactful q value and can converge to optimal value q^*, and pass through greedy strategy q^*, u can obtain optimal solution.Although the method is rationality And mixed strategy can be obtained, but its convergence property is inconspicuous.

3.wolf-phc:

Bowling＆veloso proposed the wolf-phc algorithm with learning rate changing φ in 2002, meanwhile met Rationality and convergence property.Two learning parameter φ_loseAnd φ_winBe used for showing the win of intelligent body with defeated.Wolf-phc is based on void Intend game, the averagely greedy plan that it can pass through approximate equalization replaces unknown balance policy.

For a known intelligent body, based on mixed strategy collection u_k(s_k,a_k), it can be in state s_kIt is transitioned into s_k+1And have Exploration action a is executed in the case of having reward function r_k, q function will be according to formula q_k+1(s, a)=q_k(s,a)+αδ_ke_k(s, a) and q_k+1(s_k,a_k)=q_k+1(s_k,a_k)+αρ_kIt is updated, u (s_k,a_k) more new law be

φ in formula_iFor learning rate changing, and φ_lose>φ_win.If average mixed strategy value is lower than current strategy value, intelligence Body can win, select φ_win, otherwise select φ_lose.Its more new law is

In formulaFor average mixed strategy.

Execution action a_kAfterwards, to s_kUnder state, the mixed strategy table of everything is updated,

\tilde{u} (s_{k}, a_{i}) &leftarrow; \tilde{u} (s_{k}, a_{i}) + (u (s_{k}, a_{i}) - \tilde{u} (s_{k}, a_{i})) / v i s i t (s_{k}), \forall a_{i} &element; a

Visit (s in formula_k) by the s being experienced from original state to current state_kNumber of times.

Embodiment:

The present embodiment is under the general frame of south electric network, and with Guangdong Power Grid as main study subject, phantom is Detailed full dynamic simulation model, detailed model parameter and emulation that Guangdong Center of Electric Dispatching and Transforming's practical engineering project is built Design principle refers to seat and builds, " the power system of quick multi-agent Learning strategy of being climbed the mountain based on wolf that Yu Tao, Zhang Xiaoshun deliver Intelligent power generation controls " (electrotechnics journal).In this phantom, south electric network is divided into Guangdong, Guangxi, Yunnan and four, Guizhou Regional power grid, and in the case of nominal parameters and two kinds of 10% white noise parameter perturbation of addition, be modeled using simulink Simulation study, is estimated to model performance.Control coordination, multiple agent intelligence in order to design learning rate changing to obtain intelligent power generation Generation Control can provide Average Strategy value.

Intelligent power generation design of control method based on wolf hill-climbing algorithm is as follows:

1): analyze the behaviour of systems with to state set s discretization；This example is drawn according to Guangdong power grid dispatching center cps index Minute mark is accurate, and cps1/cps2 value is divided into 6 states (- ∞, 0), and [0,100%), [100%, 150%), [150%, 180%), [180%, 200%), [200% ,+∞), then ace is divided into positive and negative 2 states, thus each intelligent body can be true Surely there are 12 states.The state of ace is primarily to distinguish the reason cause cps index to fluctuate；

2): determine teamwork discrete set a, using the obfuscation at action interval, action a total of 49 of the obfuscation in interval Rule, each rule has 7 discrete movement surely.

3): when each controlling cycle starts, the real-time running data of collection regional electrical network: △ f, △ p, wherein △ F represents system frequency deviation, and △ p represents dominant eigenvalues deviation；According to international evaluation method, ace=t_a-t_s-10b (f_a-f_s)(t_aFor interconnection actual trend value, t_sExpect trend value for interconnection, b is frequency bias coefficient, f_aActual for system Frequency values, f_sFor system expected frequency value), cps1=(2-cf1) × 100%,(b_iFor The frequency bias coefficient of control area i；ε₁For interconnected network to annual 1 minute frequency averaging deviation root mean square control targe Value；N is the number of minutes of this examination period；ace_ave-1minFor meansigma methodss in 1 minute for the area control error ace；△f_aveFor frequency Meansigma methodss in 1 minute for the rate deviation △ f；), cps2=(1-r) × 100%,(ε₁₀ For interconnected network to annual 10 minutes frequency averaging deviations root mean square control targe value；b_netFrequency departure for whole interconnected network Coefficient；ace_ave-10minFor meansigma methodss in 10 minutes for the area control error ace), Calculate ace_i(k)、cps_iThe instantaneous value of (k).

4): according to the ace of regional_i(k)、cps_iK the instantaneous value of () determines current state s, then by state s and award Function obtains the reward function signal r of a short-term of regional power grid_i(k), reward function design is as follows:

\{\begin{matrix} \begin{matrix} r_{i} (s_{k - 1}, s_{k}, a_{k - 1}) = σ_{i} - μ_{1 i} {δp}_{i} {(k)}^{2} & {cpsl}_{i} (k) &greaterequal; 200 \end{matrix} \\ r_{i} (s_{k - 1}, s_{k}, a_{k - 1}) = - η_{1 i} [| {ace}_{i} (k) | - | {ace}_{i} (k - 1) |] - μ_{1 i} {δp}_{i} {(k)}^{2} \\ {cpsl}_{i} (k) &element; [100, 200) \\ r_{i} (s_{k - 1}, s_{k}, a_{k - 1}) = - η_{2 i} [\begin{matrix} | {cpsl}_{i} (k) - 200 | - \\ | {cpsl}_{i} (k - 1) - 200 | \end{matrix}] - μ_{2 i} {δp}_{i} {(k)}^{2} \\ {cpsl}_{i} (k) < 100 \end{matrix}

5): to all regional power grids, value of calculation function error p_k=r (s_k,s_k+1,a_k)+0.9×q_k(s_k+1,a_g)-q_k(s_k, a_k), estimated value function error

δ_k=r (s_k,s_k+1,a_k)+0.9×q_k(s_k+1,a_g)-q_k(s_k,a_k) (γ is discount factor, takes 0.9, a_gMove for greedy Make strategy).

6): to all regional power grids, determine optimal objective value functionAnd strategy (a is behavior aggregate).

7): to all regional power grids, e_k+1(s,a)←0.9×0.9×e_k(s a) updates eligibility trace matrix, q_k+1(s, a)=q_k(s,a) +0.1×δ_ke_k(s, a) updates q function form, according to

Two formulas update mixed strategy,

q_k+1(s_k,a_k)=q_k+1(s_k,a_k)+0.1×p_kUpdate value function q_k+1(s_k,a_k),

e_{k + 1} (s, a) = \{\begin{matrix} 0.9 \times 0.9 \times e_{k} (s, a) + 1, & (s, a) = (s_{k}, a_{k}) \\ 0.9 \times 0.9 \times e_{k} (s, a) & o t h e r w i s e \end{matrix}

Update eligibility trace element

e(s_k,a_k)←e(s_k,a_k)+1, Update learning rate changing

\tilde{u} (s_{k}, a_{i}) &leftarrow; \tilde{u} (s_{k}, a_{i}) + (u (s_{k}, a_{i}) - \tilde{u} (s_{k}, a_{i})) / v i s i t (s_{k}), \forall a_{i} &element; a

Update average mixed strategy table.

8): when upper once controlling cycle arrives, return to step 3.

The core of the present invention is the selection of reward function, the obfuscation at action interval and parameter designing.Wherein in wolf- On the basis of phc, merged sarsa (λ) and eligibility trace be this patent key innovations, the reality of this method or correlation technique Apply, efficiently solve Stochastic Game and solve and the application problem in non-Markovian environment, be allowed to obtain more quick Convergence property and the learning efficiency, and interconnect under complex electric network environment at random by force in multizone, it has high degree of adaptability and Shandong Rod, meets and coordinates the needs that optimal power generation controls between multi-region electric network.

Control method of the present invention can completely be described as follows:

1): state discrete collection s is determined by the division of the value of control performance standard cps1/cps2；

2): according to motion blurization rule, determine teamwork discrete set a；

3): when each controlling cycle starts, gather the real-time running data of each electrical network: frequency departure △ f and power Deviation △ p, calculates the ace of regional_i(k) and cps_iThe instantaneous value of (k)；

4): determine current state s, then a short-term prize being obtained certain regional power grid i by current state s and reward function Encourage function signal r_i(k)；

5): pass through

p_k=r (s_k,s_k+1,a_k)+γq_k(s_k+1,a_g)-q_k(s_k,a_k)

And δ_k=r (s_k,s_k+1,a_k)+γq_k(s_k+1,a_g)-q_k(s_k,a_k) obtain value function error p_k、δ_k；

6): ask for optimal objective value functionAnd strategy

7): all regional power grids are executed:

e_k+1(s,a)←γλe_k(s, a) updates eligibility trace matrix,

q_k+1(s, a)=q_k(s,a)+αδ_ke_k(s, a) updates q function form,

Update mixed strategy u_k(s_k,a_k),

q_k+1(s_k,a_k)=q_k+1(s_k,a_k)+αp_kUpdate value function q_k+1(s_k,a_k),

e_{k + 1} (s, a) = \{\begin{matrix} {γλe}_{k} (s, a) + 1, & (s, a) = (s_{k}, a_{k}) \\ {γλe}_{k} (s, a) & o t h e r w i s e \end{matrix}

Update eligibility trace element

e(s_k,a_k)←e(s_k,a_k)+1, Update learning rate changing

Update average mixed strategy table.

8): when upper once controlling cycle arrives, return to step 3.

Above-described embodiment is the present invention preferably embodiment, but embodiments of the present invention are not subject to above-described embodiment Limit other any spirit without departing from the present invention and the change made under principle, modification, replacement, combine, simplify, all Should be equivalent substitute mode, be included within protection scope of the present invention.

Claims

1. a kind of intelligent power generation control method based on wolf hill-climbing algorithm is it is characterised in that comprise the following steps:

Step 1: determine state discrete collection s；

Step 2: determine teamwork discrete set a；

Step 3: when each controlling cycle starts, gather the real-time running data of each electrical network, described real-time running data bag Include frequency departure △ f and power deviation △ p, calculate regional and control error ace_iThe instantaneous value of (k) and control performance standard cps_iThe instantaneous value of (k)；

Step 4, determine current state s, then obtained a short-term prize of certain regional power grid i by current state s and reward function Encourage function signal r_i(k)；

Step 5, by calculate with estimate obtain value function error p_k、δ_k；

Step 6, ask for optimal objective value function and strategy by function；

Step 7, to all regional power grid j, update all state-actions to (s, q function form a) and eligibility trace matrix e_j(s, A), and by the q value updating update mixed strategy u under current state s_k(s_k,a_k), then by mixed strategy u_k(s_k,a_k) updated value Function q_k+1(s_k,a_k), eligibility trace element e (s, a), learning rate changing φ and average mixed strategy table；

Step 8, return to step 3.

2. according to claim 1 a kind of intelligent power generation control method based on wolf hill-climbing algorithm it is characterised in that: described step Rapid 1 state discrete collection s, is determined by the division of the value of control performance standard cps1/cps2.

3. according to claim 1 a kind of intelligent power generation control method based on wolf hill-climbing algorithm it is characterised in that: described step In rapid 2, according to motion blurization rule, determine interval action.

4. according to claim 1 a kind of intelligent power generation control method based on wolf hill-climbing algorithm it is characterised in that: described step Rapid 3 real-time running data, is gathered using computer and monitoring system.

5. according to claim 1 a kind of intelligent power generation control method based on wolf hill-climbing algorithm it is characterised in that: described step In rapid 3, the area control error ace of described region i_iK the instantaneous value calculating method of () is as follows:

Ace=t_a-t_s-10b(f_a-f_s),

Wherein, t_aFor interconnection actual trend value, t_sExpect trend value for interconnection, b is frequency bias coefficient, f_aReal for system Border frequency values, f_sFor system expected frequency value；

Cps1=(2-cf1) × 100%,

Wherein,b_iFrequency bias coefficient for control area i；ε₁For interconnected network to complete The root mean square control targe value of year 1 minute frequency averaging deviation；N is the number of minutes of this examination period；ace_ave-1minFor region control Meansigma methodss in 1 minute for error ace processed；△f_aveFor meansigma methodss in 1 minute for the frequency departure △ f；

Cps2=(1-r) × 100%,

Wherein,

ε₁₀For interconnected network to annual 10 minutes frequency averaging deviation root-mean-square Control targe value；b_netFrequency bias coefficient for whole interconnected network；ace_ave-10minFor area control error ace at 10 points Meansigma methodss in clock.

6. according to claim 1 a kind of intelligent power generation control method based on wolf hill-climbing algorithm it is characterised in that: described step Rapid 4 short-term reward function signal r_iK () is by following formula obtained by, formula is as follows:

{\begin{matrix} \begin{matrix} r_{i} (s_{k - 1}, s_{k}, a_{k - 1}) = σ_{i} - μ_{1 i} {δp}_{i} {(k)}^{2} & c p s 1_{i} (k) &greaterequal; 200 \end{matrix} \\ r_{i} (s_{k - 1}, s_{k}, a_{k - 1}) = - η_{1 i} [| {ace}_{i} (k) | - | {ace}_{i} (k - 1) |] - μ_{1 i} {δp}_{i} {(k)}^{2} \\ c p s 1_{i} (k) &element; [100, 200) \\ r_{i} (s_{k - 1}, s_{k}, a_{k - 1}) = - η_{2 i} [\begin{matrix} | c p s 1_{i} (k) - 200 | - \\ | c p s 1_{i} (k - 1) - 200 | \end{matrix}] - μ_{2 i} {δp}_{i} {(k)}^{2} \\ c p s 1_{i} (k) < 100 \end{matrix} .

Wherein, r_i(s_k-1,s_k,a_k-1) it is in selected action a_k-1Lower state is from s_k-1To s_kIntelligent body reward function, ace_i(k) And cps1_iK () is respectively the instantaneous value that regional power grid i kth walks ace and cps1 of iteration, σ_iReward maximum for region i history.

7. according to claim 1 a kind of intelligent power generation control method based on wolf hill-climbing algorithm it is characterised in that: described step Rapid 5 value function error p_k、δ_kBy formula:

p_k=r (s_k,s_k+1,a_k)+γq_k(s_k+1,a_g)-q_k(s_k,a_k)

And δ_k=r (s_k,s_k+1,a_k)+γq_k(s_k+1,a_g)-q_k(s_k,a_k)

Obtained, wherein, r (s_k,s_k+1,a_k) it is in selected action a_kLower state is from s_kTo s_k+1Intelligent body reward function, γ is Discount factor, the span of γ is 0 < γ < 1, a_gFor greedy action policy.

8. according to claim 1 a kind of intelligent power generation control method based on wolf hill-climbing algorithm it is characterised in that: described step In rapid 6, optimal objective value functionWith tactful π^*S () is

v^{π^{*}} (s) = \underset{a &element; a}{m a x} q (s, a)

π^{*} (s) = \arg \underset{a &element; a}{m a x} q (s, a)

In formula, a is behavior aggregate.

9. according to claim 1 a kind of intelligent power generation control method based on wolf hill-climbing algorithm it is characterised in that: described step In rapid 7, by formula:

e_k+1(s,a)←γλe_k(s,a)

Update eligibility trace matrix, according to formula:

q_k+1(s, a)=q_k(s,a)+αδ_ke_k(s,a)

Update q function form, wherein, e_k(s, is a) eligibility trace that kth walks iteration under state s action a, γ is discount factor, The span of γ is mark decay factor for 0 < γ < 1, λ, and the span of λ is q learning rate for 0 < λ < 1, α, and α setting scope is 0 <α<1.

10. according to claim 1 a kind of intelligent power generation control method based on wolf hill-climbing algorithm it is characterised in that: described Mixed strategy u in step 7_k(s_k,a_k) updated according to following formula:

In formula, φ_iFor learning rate changing；

In described step 7, according to formula:

q_k+1(s_k,a_k)=q_k+1(s_k,a_k)+αp_k

Update value function q_k+1(s_k,a_k), according to formula:

e_{k + 1} (s, a) = \{\begin{matrix} {γλe}_{k} (s, a) + 1, & (s, a) = (s_{k}, a_{k}) \\ {γλe}_{k} (s, a) & o t h e r w i s e \end{matrix}

Update learning rate changingAccording to formula:

\tilde{u} (s_{k}, a_{i}) &leftarrow; \tilde{u} (s_{k}, a_{i}) + (u (s_{k}, a_{i}) - \tilde{u} (s_{k}, a_{i})) / v i s i t (s_{k}), \forall a_{i} &element; a

Update average mixed strategy table, in formula,WithTwo learning parameters are used for representing the win of intelligent body and defeated, visit (s_k) by the s being experienced from original state to current state_kNumber of times.