CN107045655A

CN107045655A - Wolf pack clan strategy process based on the random consistent game of multiple agent and virtual generating clan

Info

Publication number: CN107045655A
Application number: CN201611117291.8A
Authority: CN
Inventors: 席磊; 李玉丹; 杨苹; 许志荣; 柳浪; 陈建峰
Original assignee: China Three Gorges University CTGU
Current assignee: China Three Gorges University CTGU
Priority date: 2016-12-07
Filing date: 2016-12-07
Publication date: 2017-08-15

Abstract

Wolf pack clan strategy based on the random consistent game of multiple agent and virtual generating clan, including step：S1, determine state discrete collection S；S2, determine teamwork discrete set A；S3, when each controlling cycle starts, gather the real-time running data of each power network, the real-time running data includes frequency departure △ f and power deviation, calculate regional control error ACE_i(k) instantaneous value and control performance standard CPS_i(k) instantaneous value；S4, in current state S, certain regional power grid i obtains a short-term reward function signal R_i(k)；S5, pass through to calculate and obtain value function error p with estimation_k、δ_k；S6, optimal objective value function and strategy are asked for by function.This method combines harmonious (MAS CC) two frameworks of the multi-agent system of multi intelligent agent game theory (MAS SG) and isolog, the problem of solving the coordination optimization of virtual generating clan, has the advantages that to improve closed-loop system system, improves utilization of new energy resources rate, reduces carbon emission, fast convergence.

Description

Wolf pack clan strategy based on the random consistent game of multiple agent and virtual generating clan Method

Technical field

It is more particularly to a kind of consistent at random based on multiple agent the present invention relates to Economic Dispatch technical field Game and the wolf pack clan strategy process of virtual generating clan, it is adaptable to the dynamic multi-objective optimization of distributing economic load dispatching point Match somebody with somebody.

Background technology

AGC can be generally divided into two steps：A), the tracking of the total generated outputs of AGC, b), by optimized algorithm total hair Electrical power distribution is to each AGC unit.In fact, the AGC general powers that PI controllers have been widely used for IDN coordinate control. In order to further improve AGC adaptability and control performance, existing document propose it is a kind of be used to exchanging microgrid based on online The fuzzy evolutionary algorithm of particle swarm optimization algorithm (particle swarm optimization, PSO).Bacterium is looked for food to optimize and calculated Method (bacteria foraging optimization, BFO), PSO, genetic algorithm (genetic algorithm, GA) and biography The gradient algorithm of system is all applied to optimize control parameters all in microgrid.On the other hand,《The interconnected network learnt based on Q Dynamic optimal CPS is controlled》Have studied intensified learning is used to realize interconnected network SGC, so as to improve AGC dynamic control performances.So And, the research method of above-mentioned document is all centerized fusion, it is necessary to substantial amounts of distant place information, therefore dynamic response is slow, control Performance is not ideal enough.

Distributing correlated equilibrium Q (λ) methods (decentralized based on multiple agent that existing document is proposed Correlated equilibrium Q (λ)-learning, DCEQ (λ)) SGC complexity is solved as optimal policy at random Dynamic characteristic and optimistic coordinated control problem, learn with Q, and Q (λ) study, R (λ) study is compared with PI control algolithms with more excellent Control performance.

But can further be improved in view of its control performance, and when intelligent body number increase, DCEQ (λ) algorithm In the search MA equilibrium solution times in the increase of geometry number, the extensive use in more massive network system of its method is limited. Bowling＆Veloso developed hill climbing algorithm (the win or learn fast of " win " or " Fast Learning " in 2002 Policy hill-climbing, WoLF-PHC)；In study, each intelligent body is using mixed strategy and only preserves the Q values of itself Table.So, on the one hand, it avoids the exploration for needing to solve in general Q study and utilizes this contradictory problems；On the other hand, It can solve the asynchronous decision problem of MA systems.Therefore, based on wolf-phc, eligibility trace and SARSA (λ), it is proposed that a kind of Q The mutation algorithm of (λ) study, i.e., the distributing based on multiple agent is won or Fast Learning hill climbing method (decentralized Win or learn fast policy hill-climbing (λ), DWoLF-PHC (λ), hereinafter referred to as wolf climbs the mountain).The algorithm Utilize the learning rate of changeThe change of environment is perceived in MA, accommodation itself strategy encourages algorithmic statement with this To optimal solution and it ensure that the reasonability of algorithm.With WoLF characteristics, i.e. Win or Learning Fast.It is sharp in algorithm Equilibrium is instead of with average mixed strategy.But the above method is that the tracking that general power is instructed is studied, without Dynamic optimization distribution is carried out to AGC power instructions.Also, when intelligent body number continues to increase, it may appear that many solutions are topic, are caused System is unstable.Therefore need to explore new method, to obtain distributing optimistic coordinated control.

The content of the invention

To overcome the shortcoming and defect of existing Q learning algorithms, the harmonious property of distribution type control system, this hair are solved It is bright to propose a kind of wolf pack clan strategy process based on the random consistent game of multiple agent and virtual generating clan.This method knot Harmonious two frameworks of multi-agent system of multi intelligent agent game theory and isolog are closed, mark decay factor has been considered λ, discount factor γ, Q learning rate α, learning rate changingDeng the influence to system convergence effect；Also contemplate communication delay, noise and The influence that change in topology is dispatched to distributing, further expands the tactful scope of application.Engineering practice can be preferably applicable In nonideal communication environment, and have more preferable optimum results.

The technical solution adopted in the present invention is：

Wolf pack clan strategy process based on the random consistent game of multiple agent and virtual generating clan, including following step Suddenly：

Step S1, determine state discrete collection S.

Step S2, determine teamwork discrete set A.

Step S3, when each controlling cycle starts, gather the real-time running data of each power network, the real time execution number Error ACE is controlled according to regional including frequency departure △ f and power deviation △ P, is calculated_i(k) instantaneous value and control performance Standard CPS_i(k) instantaneous value.

Step S4, in current state S, certain regional power grid i obtains a short-term reward function signal R_i(k)。

Step S5, pass through to calculate and obtain value function error p with estimation_k、δ_k。

Step S6, optimal objective value function and strategy are asked for by function.

Step S7, to all regional power grid j, update institute it is stateful-action is to (s, Q functions form and eligibility trace square a) Battle array e_j(s, a), and by the mixed strategy U under the Q values renewal current state S updated_k(s_k,a_k), then by mixed strategy U_k(s_k,a_k) Update value function Q_k+1(s_k,a_k), eligibility trace element e (s, a), learning rate changing φ and average mixed strategy table.

Step S8, determine Laplacian Matrix L, obtain clan power Δ Pi, and climbing slope is obtained by clan's power.

Step S9, leader and the virtual uniformity variable that followed by are updated.

Step S10, each power of the assembling unit △ P of solution_iwIf the power of the assembling unit is out-of-limit, step S9 is jumped to.

Step S11, when reaching boundary condition, calculate generated output △ P_iwAnd t_iwWith renewal row stochastic matrix element.

Step S12, calculating power deviation △ P_error-i, judge whether to meet, next step calculating carried out if meeting, if not Satisfaction then jumps to step 9.

Step S13, return to step S3.

The state discrete collection S of the step S1, passes through the division of control performance standard CPS and area control error ACE values To determine.

The teamwork discrete set A of step S2 expression formula is：

A=A₁×A₂×…×A_i×…×A_nWherein, A_iFor intelligent body i output discrete movement collection, n is intelligent body Number.

The real-time running data of the step S3 is gathered using computer and monitoring system.

In the step S3, the area control error ACE of the region i_i(k) instantaneous value calculating method is as follows：

ACE=T_a-T_s-10B(F_a-F_s),

Wherein, T_aFor the actual trend value of interconnection, T_sExpect trend value for interconnection, B is frequency bias coefficient, F_aTo be System actual frequency values, F_sFor system expected frequency value.

The CPS of the control performance standard 1 of the region i_i(k) instantaneous value calculating method is as follows：

CPS1=(2-CF1) × 100%,

Wherein,B_iFor control area i frequency bias coefficient；ε₁For interconnection electricity Control targe value of the net to annual 1 minute frequency averaging deviation root mean square；N is the number of minutes of the examination period；ACE_AVE-1minFor Average values of the area control error ACE in 1 minute；△f_AVEFor average values of the frequency departure △ f in 1 minute.

The CPS of the control performance standard 2 of the region i_i(k) instantaneous value calculating method is as follows：

CPS2=(1-R) × 100%,

Wherein,

In formula, ε₁₀For control targe value of the interconnected network to annual 10 minutes frequency averaging deviation root mean square；B_netTo be whole The frequency bias coefficient of interconnected network；ACE_AVE-10minFor average values of the area control error ACE in 10 minutes.

The short-term reward function signal R of the step S4_i(k) obtained by following formula, formula is as follows：

Wherein, ACE (k) and △ P in formula_iw(k) instantaneous value and kth of kth step iteration area control error are represented respectively Walk the real output of w-th of unit in iteration；μ and (1- μ) represent area control error and the weights of carbon emission respectively, The μ values in each region are identical, and μ=0.5 is set to herein；D_iwIt is the carbon intensity coefficient of w units, unit is kg/kWh；WithThe respectively bound of w unit capacities；Thermoelectricity generating set efficiency is considered, when unit adjustable capacity is big When 600MW, D_j=0.87, when unit rated capacity, which is less than or equal to 600MW, is more than 300MW, take D_j=0.89, when unit holds Amount is less than or equal to take D during 300MW_j=0.99.The D of fuel oil consump-tion, Gas Generator Set and Hydropower Unit in each VGT_jSet respectively For 0.7,0.5,0.

The value function error p of the step S5_k、δ_kBy formula p_k=R (s_k,s_k+1,a_k)+γQ_k(s_k+1,a_g)-Q_k(s_k,a_k) And δ_k=R (s_k,s_k+1,a_k)+γQ_k(s_k+1,a_g)-Q_k(s_k,a_k)

Wherein, R (s_k,s_k+1,a_k) it is in selected action a_kLower state is from s_kTo s_k+1Intelligent body reward function, γ for folding The factor is detained, γ span is 0<γ<1, a_gFor greedy action policy.

In the step S6, optimal objective value functionAnd strategyFor

In formula, A is behavior aggregate.

In the step S7, update eligibility trace matrix and pass through formula：

e_k+1(s,a)←γλe_k(s a), updates Q function forms, according to formula Q_k+1(s, a)=Q_k(s,a)+αδ_ke_k(s,a)

Wherein, e_k(s a) walks the eligibility trace of iteration, γ is discount factor, γ value model for kth under acting a in state s Enclose for 0<γ<1, λ is mark decay factor, and λ span is 0<λ<1, α is Q learning rates, and α sets scope to be 0<α<1.

Mixed strategy U in the step S7_k(s_k,a_k) updated according to following formula：

In formula, φ_iFor learning rate changing.

In the step S7, value function Q is updated_k+1(s_k,a_k), according to formula：

Q_k+1(s_k,a_k)=Q_k+1(s_k,a_k)+αp_k

Update eligibility trace element e (s_k,a_k)←e(s_k,a_k)+1, according to formula：

Update learning rate changingAccording to formula：

Average mixed strategy table is updated, according to formula：

In formula,WithTwo learning parameters are used for representing the win of intelligent body and defeated, visit (s_k) it is from original state The s undergone to current state_kNumber of times.

Step S8, determines Laplace formula L=[l_ij] ∈ Rn × n, according to formula

In formula, constant b_ij(b_ij>=0) represent weight factor between intelligent body

Climbing slope is calculated, according to formula：

In formula, UR_iwAnd DR_iwIt is the bound of climbing slope respectively.

Step S9, is updated to leader and the virtual uniformity variable that followed by, according to formula：

The former is that leader's uniformity variable is updated, and the latter is that the virtual uniformity variable that followed by is carried out more Newly, in formula, in i-th of VGT, mi is unit sum,It is μ i in random row matrix, formula>0 represents i-th The Dynamic gene of individual VGT power deviations, △ P_error-iRepresent the deviation of i-th of VGT general powers instruction and all aggregate capacities.

Further, power of the assembling unit △ P are calculated_iwIf the power of the assembling unit is out-of-limit, power deviation is calculated, is judged whether full Sufficient condition, then obtain the power of the assembling unit, carries out repeatedly calculating k=k+1 next time, deviation is unsatisfactory for condition, then since being calculated uniformity Above step is repeated to calculate；When reaching boundary condition, generated output △ P are calculated_iwAnd t_iw, according to formula:

Row stochastic matrix element is updated, according to formula

In formula, L=[l_ij]∈R^n×nIt is Laplacian Matrix, constant b_ij(b_ij>=0) represent weight factor between intelligent body

D=[d_ij]∈R^n×nRepresent row stochastic matrix,It is i-th of VGT weighted adjacent matrix

Further, the deviation △ P of aggregate capacity_error-iCalculation formula：

If Δ P_i>0, thenOtherwise

Further, judge whether power deviation meets condition, meet then progress next iteration calculating and take k=k+1；No Meet, the step of jumping to calculating uniformity.

A kind of wolf pack clan strategy process based on the random consistent game of multiple agent and virtual generating clan of the present invention, has Beneficial effect is as follows：

1)：The present invention combines the multiple agent system of multi-agent system multi intelligent agent game theory (MAS-SG) and isolog Harmonious (MAS-CC) the two big frame system of system, the problem of solving the coordination optimization of virtual generating clan.

2)：The present invention solves distributing correlated equilibrium Q (λ) methods based on multiple agent in intelligent body number increase Extensive use of its method in more massive network system is limited, is improved in existing hill climbing algorithm, the calculation Method utilizes the learning rate changedThe change of environment is perceived in MA, accommodation itself strategy encourages algorithm to receive with this Hold back optimal solution and ensure that the reasonability of algorithm.

3)：The present invention solve tradition centralization AGC can not meet new energy continuous access and intelligent grid " i.e. insert Use " demand, the algorithm using virtual consistent variable solve due to power it is out-of-limit caused by change in topology and AGC units i.e. Plug-and-play problem.

Brief description of the drawings

Fig. 1 is AGC MAS control frameworks.

Fig. 2 is VTC frame diagrams.

Fig. 3 is that wolf pack hunts strategic process figure.

Fig. 4 is LOAD FREQUENCY Controlling model figure described in embodiment.

Embodiment

Wolf pack clan strategy process based on the random consistent game of multiple agent and virtual generating clan, complete description is such as Under：

1), analyze the behaviour of systems determination state discrete collection S, specifically can determine state by CPS 1 and ACE values division Discrete set S.

2) teamwork discrete set A, wherein A=A, are determined₁×A₂×…×A_i×…×A_n, A_iFor intelligent body i output Discrete movement collection, n is intelligent body number.

3) real-time running data of regional power network, is gathered when each controlling cycle starts：△ f, △ P, and calculate The ACE of regional_iAnd CPS (k)_i(k) instantaneous value R_i(k), R_i(k) the regional power grid i kth that is designed as walks ACE's and CPS1 The linear combination of difference value and power adjustment value.

4), using in selected action a_kLower state is from s_kTo s_k+1Intelligent body reward function R (s_k,s_k+1,a_k), value folding Detain factor gamma, its value scope 0<γ<1, by ρ_k=R (s_k,s_k+1,a_k)+γQ_k(s_k+1,a_g)-Q_k(s_k,a_k) and δ_k=R (s_k,s_k+1, a_k)+γQ_k(s_k+1,a_g)-Q_k(s_k,a_g) the Q function errors p that intelligent body is walked in kth is asked for respectively_kWith the error evaluation δ of Q functions_k。

5), for each state action to (s a), is performed：

1. eligibility trace matrix e is updated_k+1(s, a)=λ × γ e_k(s,a)；

2. Q functions Q is updated_k+1(s, a)=Q_k(s,a)+αδ_ke_k(s,a)；

Wherein, λ, γ, α are respectively the qualification decay factor of control system, discount factor, Q learning rates, their value model It is trapped among [0,1], first step of execution updates eligibility trace matrix e with λ and γ_k(s a), updates the mistake asked required by Q functions Difference assesses δ_k, learning rate α and eligibility trace e_k(s,a)。

6) for region j, perform

1. step 5 obtains value function Q_k+1(s, a)=Q_k(s,a)+αδ_ke_k(s,a)

2. formula is utilizedSolve mixed strategyWhereinIt is learning rate changing, | A_i| it is behavior aggregate Element number.

③：According to take satisfaction require Q learning parameters α and step 4 required by Q functional value errors ρ_k, update value function Q_k+1(s_k, a_k)=Q_k+1(s_k, a_k)+αp_k；

4. eligibility trace element e (s are updated_k,a_k)=e (s_k,a_k)+1；

5. learning rate changing is selectedAccording to

In formula,WithTwo learning parameters be used for representing the win of intelligent body with it is defeated.

6. the mixing Average Strategy U (s required by step 6 are utilized_k,a_k), update average mixed strategy table：

7. the number of times undergone from original state to current state is updated：

visit(s_k)=visit (s_k)+1

For i VTG, clan power △ P are first obtained_iw(i=1,2,3 ..., n)

7) climbing slope, is solved

Wherein, UR_iwAnd DR_iwIt is the bound of climbing slope respectively.

8), leader and the virtual uniformity variable that followed by are updated, according to formula：

The former is that leader's uniformity variable is updated, and the latter is that the virtual uniformity variable that followed by is carried out more Newly, in formula, in i-th of VGT, m_iIt is unit sum,It is μ i in random row matrix, formula>0 represents i-th The Dynamic gene of individual VGT power deviations, △ P_error-iRepresent the deviation of i-th of VGT general powers instruction and all aggregate capacities.

9), when reaching boundary condition, generated output Δ P is calculated_iwAnd t_iw, according to formula：

In formula, UR_iwAnd DR_iwIt is the bound of climbing slope respectively.

10) following machine matrix, is updated, according to formula：

D=[d_ij]∈R^n×nRepresent row stochastic matrix,It is i-th of VGT weighted adjacent matrix.

11) power deviation and judgment bias, are calculated, according to formula

10) more, | P_error-r|<ε_i

Power of the assembling unit △ P obtained by step 10_iw, calculate power deviation △ P_error-i, whether judgment bias meets condition, full Foot obtains the power of the assembling unit.It is unsatisfactory for, jumps to step 8.

12) power of the assembling unit △ P, are obtained_iw, next iteration k=k+1 is carried out, step 1 is jumped to.

The application that wolf pack hunts strategy is not limited by centralization calculating and single centralized controller power instruction distribution System.If in fact, some intelligent body failures, other intelligent bodies can continue to carry out information interchange and realize new uniformity. Due to generally there is more than one communication channel between intelligent body, when certain passage breaks down, AGC performances can still be kept It is optimal.This information sharing depended between each intelligent body, as shown in Figure 3.Some related concepts are as follows：

1., manor：Regional power grid in one independent cut set, refers generally in province's net and three lines of defence Active Splitting system The isolated island region transmission and distribution network of matching.Typically there is large-scale plant-grid connection in the power network of manor, therefore with microgrid and active distribution Distinguish.

2., clan：Only one of which clan in one manor, clan is all true generatings for participating in frequency modulation in the power network of manor Unit and virtual synchronous generator group (such as energy-storage system and interruptible load system).

3., head：Dispatching terminal in one clan only one of which head, i.e., whole clan.Head is responsible for saving net scheduling End (higher level) and other clan's dispatching terminals (other clan leaders) are linked up, contacted with being cooperated, and instruction is issued in this clan Parent in each family.

4., family：A generating set group with similar generating control characteristic in clan, such as thermal motor group, Gas engine group etc..One clan is by multiple family compositions.

5., parent：There is leader's generation-control unit of stronger dispatching in family.Parent can carry out actively searching Rope, independently executes the instruction of complexity.

6., member：One independent generation-control unit, can only imitate the behavior of parent, perform simple instruction.

7., reserves：The reserve force just set out when needing to surround prey at crucial moment (herein refers to water-storage electricity Stand), if that is, load disturbance exceedes the 50% of preset value, hydroenergy storage station then brings into operation.It is with " energy storage wolf pack man The form in front yard " occurs.

Embodiment：

The present embodiment is that under the general frame of south electric network, using Guangdong Power Grid as main study subject, simulation model is The detailed full dynamic simulation model that Guangdong Center of Electric Dispatching and Transforming's practical engineering project is built, detailed model parameter and emulation Design principle refers to Yu Tao, Zhou Bin, what Chen Jiarong was delivered《The interconnected network dynamic optimal CPS controls learnt based on Q》(in State's electrical engineering journal), south electric network is divided into Guangdong, Guangxi, Guizhou and four, Yunnan regional power grid in the simulation model, wherein Being subject to the sampling time in Guangdong Power Grid and other each province's power networks, no more than 1500MW, (correspondence Guangdong Power Grid is maximum for 15min, amplitude Single failure-direct current monopole locking) limited-bandwidth white noise load disturbance, to south electric network each province LOAD FREQUENCY respond be Number adds white noise parameter perturbation, and simulation study is modeled using Simulink.Each regional power grid AGC Control is in synthesis Other regional power grid ACE instantaneous values and take and each seek optimal joint action policy under conditions of strategy.

Step 1), analyze the behaviour of systems with to state set s discretizations:This example refers to according to Guangdong Electric Power control centre CPS The criteria for classifying is marked, CPSl values are divided into 6 states:(- ∞, 0), [0,100%), [100%, 150%), [150%, 180%), [180%, 200%), [200% ,+∞), then ACE is divided into positive and negative 2 states, thus each intelligent body can be true Surely there are 12 states.ACE state causes the reason for CPS indexs are fluctuated primarily to distinguishing.

Step 2), determine teamwork discrete set A, the behavior aggregate of the i-th regional power grid is A_i=[- 50, -20, -10, -5, 0,5,10,20,50] MW, teamwork set of strategies is, A=A₁×A₂×…×A_i×…×A_n, A is that the output of controller is moved Make, i.e. AGC power adjustment instruction.Control step-length to use AGC controlling cycles, take 4s.

Step 3), when each controlling cycle starts gather regional power network real-time running data:△ f, △ P, its Middle △ f represent system frequency deviation, and △ P represent dominant eigenvalues deviation according to international evaluation method ACE=T_a-T_s-10B (F_a-F_s)(T_a, T_sRespectively the actual trend value of interconnection is with expecting trend value；B is frequency bias coefficient；F_a, F_sRespectively system Actual frequency values and expected frequency value),(B_iFor control area i frequency bias coefficient； ε₁For control targe value of the interconnected network to annual 1 minute frequency averaging deviation root mean square；N is the number of minutes of the examination period), CPS2=(1-R) × 100%,(ε₁₀It is interconnected network to annual 10 minutes frequencies The control targe value of average deviation root mean square；B_netFor the frequency bias coefficient of whole interconnected network；ACE_AVE-10minControlled for region Average values of the error ACE processed in 10 minutes),With formula CPS2=(1-R) × 100% calculates the ACE of regional_iAnd CPS (k)_i(k) instant value.

Step 4), according to the ACE of regional_iAnd CPS (k)_i(k) instant value determines current state s, then stateful s and prize Encourage the award value R immediately that function obtains regional power grid_i(k), reward function design is as follows：

s.t.△P_iw ^min≤△P_iw≤△P_iw ^max

ACE (k) and △ P in formula_iw(k) instantaneous value and kth step iteration of kth step iteration area control error are represented respectively In w-th of unit real output；μ and (1- μ) represent area control error and the weights of carbon emission respectively, each The μ values in region are identical, and μ=0.5 is set to herein；D_iwIt is the carbon intensity coefficient of w units, unit is kg/kWh； WithThe respectively bound of w unit capacities；Thermoelectricity generating set efficiency is considered, when unit adjustable capacity is more than During 600MW, D_j=0.87, when unit rated capacity, which is less than or equal to 600MW, is more than 300MW, take D_j=0.89, work as unit capacity D is taken during less than or equal to 300MW_j=0.99.The D of fuel oil consump-tion, Gas Generator Set and Hydropower Unit in each VGT_jIt is respectively set to 0.7,0.5,0。

5), for each state action to (s a), is performed：

1. eligibility trace matrix e is updated_k+1(s, a)=0.9 × 0.9e_k(s,a)；

2. Q functions Q is updated_k+1(s, a)=Q_k(s,a)+0.5δ_ke_k(s,a)；

Wherein, λ, γ, α are respectively the qualification decay factor of control system, discount factor, Q learning rates, and their value is 0.9th, 0.9,0.5, first step of execution updates eligibility trace matrix e with λ and γ_k(s a), updates what is asked required by Q functions Error evaluation δ_k, learning rate α and eligibility trace e_k(s,a)。

6) for region j, perform

1. step 5 obtains value function Q_k+1(s, a)=Q_k(s,a)+0.5δ_ke_k(s,a)

2. formula is utilizedSolve mixed strategyWhereinIt is learning rate changing, | A_i| it is behavior aggregate Element number, takes 11 here.

③：According to take satisfaction require Q learning parameters α and step 4 required by Q functional value errors ρ_k, update value function Q_k+1(s_k, a_k)=Q_k+1(s_k, a_k)+0.5p_k；

4. eligibility trace element e (s are updated_k,a_k)=e (s_k,a_k)+1；

5. learning rate changing is selectedAccording to

visit(s_k)=visit (s_k)+1

For i VTG, clan power △ P are first obtained_iw(i=1,2,3 ..., n)

7) climbing slope, is solved

Wherein, UR_iwAnd DR_iwIt is the bound of climbing slope respectively.

The former is that leader's uniformity variable is updated, and the latter is that the virtual uniformity variable that followed by is carried out more Newly, in formula, UR_iwAnd DR_iwIt is the bound of climbing slope respectively.

10) following machine matrix, is updated, according to formula：

11) power deviation and judgment bias, are calculated, according to formula

10) more, | P_error-r|<ε_i

Core of the present invention is solution, the Mei Gezhi in the designing of reward function, optimal average mixed strategy and learning rate changing Renewal, the virtually proposition of generating clan (VTC) and the combination of consistency algorithm of energy body Q values.Wherein virtual generating clan carries Go out with uniformity calculate combination be key innovations, the implementation of this method and correlation technique, realize save net, distribution and Optimal control between microgrid, the problem of solution occurs solve when intelligent body number is numerous more and is carried out to AGC power instructions The problem of dynamic optimization is distributed, this solves the consistent game of Stochastic Game that is mixed based on isomorphism isomery multiple agent simultaneously Basic Science Problem.

The present invention proposes virtual this concept of generating clan (VGT), the multi-agent system stochastic game theory combined (MAS-SG) and harmonious the two frameworks of property (MAS-CC) of multi-agent system, to solve to instruct dynamic optimization to general power Control and distribution：Using MAS-SG frameworks, used method is that the distributing based on multiple agent is won or Fast Learning Hill climbing method (decentralized win or learn fast policy hill-climbing (λ), DWoLF-PHC (λ)), the complicated dynamic game for solving the multiple agent of isomery is discussed and the problem of decision theory, realizes the optimum control to AGC； MAS-CC frameworks are employed, used method is harmonious control algolithm (collaborative consensus Algorithm, CCA), solve quick distribution power, optimization Cooperation controlling.

Above-described embodiment is preferably embodiment, but embodiments of the present invention are not by above-described embodiment of the invention Limitation, other any Spirit Essences without departing from the present invention and the change made under principle, modification, replacement, combine, simplification, Equivalent substitute mode is should be, is included within protection scope of the present invention.

Claims

1. the wolf pack clan strategy process based on the random consistent game of multiple agent and virtual generating clan, it is characterised in that including Following steps：

Step S1, determine state discrete collection S；

Step S2, determine teamwork discrete set A；

Step S3, when each controlling cycle starts, gather the real-time running data of each power network, the real-time running data bag Frequency departure △ f and power deviation △ P are included, regional control error ACE is calculated_i(k) instantaneous value and control performance standard CPS_i(k) instantaneous value；

Step S4, in current state S, certain regional power grid i obtains a short-term reward function signal R_i(k)；

Step S5, pass through to calculate and obtain value function error p with estimation_k、δ_k；

Step S6, optimal objective value function and strategy are asked for by function；

Step S7, to all regional power grid j, update institute it is stateful-action is to (s, Q functions form and eligibility trace matrix e a)_j (s, a), and by the mixed strategy U under the Q values renewal current state S updated_k(s_k,a_k), then by mixed strategy U_k(s_k,a_k) update Value function Q_k+1(s_k,a_k), eligibility trace element e (s, a), learning rate changing φ and average mixed strategy table；

Step S8, determine Laplacian Matrix L, obtain clan power △ P_i, and climbing slope is obtained by clan's power；

Step S9, leader and the virtual uniformity variable that followed by are updated；

Step S10, each power of the assembling unit △ P of solution_iwIf the power of the assembling unit is out-of-limit, step S9 is jumped to；

Step S11, when reaching boundary condition, calculate generated output △ P_iwAnd t_iwAnd update row stochastic matrix element；

Step S12, calculating power deviation △ P_error-i, judge whether to meet, next step calculating carried out if meeting, if being unsatisfactory for Then jump to step step 9；

Step S13, return to step S3.

2. the strategy side of wolf pack clan according to claim 1 based on multiple agent random unanimously game and virtual generating clan Method, it is characterised in that：The teamwork discrete set A of step S2 expression formula is：

3. the strategy side of wolf pack clan according to claim 1 based on multiple agent random unanimously game and virtual generating clan Method, it is characterised in that：The real-time running data of the step S3 is gathered using computer and monitoring system；

The area control error ACE of the region i_i(k) instantaneous value calculating method is as follows：

ACE=T_a-T_s-10B(F_a-F_s),

Wherein, T_aFor the actual trend value of interconnection, T_sExpect trend value for interconnection, B is frequency bias coefficient, F_aIt is real for system Border frequency values, F_sFor system expected frequency value；

CPS1=(2-CF1) × 100%,

Wherein,B_iFor control area i frequency bias coefficient；ε₁It is interconnected network to complete The control targe value of year 1 minute frequency averaging deviation root mean square；N is the number of minutes of the examination period；ACE_AVE-1minControlled for region Average values of the error ACE processed in 1 minute；△f_AVEFor average values of the frequency departure △ f in 1 minute；

CPS2=(1-R) × 100%,

Wherein,

In formula, ε₁₀For control targe value of the interconnected network to annual 10 minutes frequency averaging deviation root mean square；B_netFor whole interconnection The frequency bias coefficient of power network；ACE_AVE-10minFor average values of the area control error ACE in 10 minutes.

4. the strategy side of wolf pack clan according to claim 1 based on multiple agent random unanimously game and virtual generating clan Method, it is characterised in that：The short-term reward function signal R of the step S4_i(k) obtained by following formula, formula is as follows：

Wherein, ACE (k) and △ P in formula_iw(k) instantaneous value and kth step for representing kth step iteration area control error respectively change The real output of w-th of unit in generation；μ and (1- μ) represent area control error and the weights of carbon emission respectively, each The μ values in region are identical, and μ=0.5 is set to herein；D_iwIt is the carbon intensity coefficient of w units, unit is kg/kWh； WithThe respectively bound of w unit capacities；Thermoelectricity generating set efficiency is considered, when unit adjustable capacity is more than During 600MW, D_j=0.87, when unit rated capacity, which is less than or equal to 600MW, is more than 300MW, take D_j=0.89, work as unit capacity D is taken during less than or equal to 300MW_j=0.99；The D of fuel oil consump-tion, Gas Generator Set and Hydropower Unit in each VGT_jIt is respectively set to 0.7,0.5,0。

5. the strategy side of wolf pack clan according to claim 1 based on multiple agent random unanimously game and virtual generating clan Method, it is characterised in that：The value function error p of the step S5_k、δ_kBy formula：

p_k=R (s_k,s_k+1,a_k)+γQ_k(s_k+1,a_g)-Q_k(s_k,a_k) and δ_k=R (s_k,s_k+1,a_k)+γQ_k(s_k+1,a_g)-Q_k(s_k, a_k)

Wherein, R (s_k,s_k+1,a_k) it is in selected action a_kLower state is from s_kTo s_k+1Intelligent body reward function, γ be discount because Son, γ span is 0<γ<1, a_gFor greedy action policy.

6. the strategy side of wolf pack clan according to claim 1 based on multiple agent random unanimously game and virtual generating clan Method, it is characterised in that：In the step S6, optimal objective value functionWith tactful π^*(s) it is：

In formula, A is behavior aggregate.

7. the strategy side of wolf pack clan according to claim 1 based on multiple agent random unanimously game and virtual generating clan Method, it is characterised in that：In the step S7, update eligibility trace matrix and pass through formula：

e_k+1(s,a)←γλe_k(s a), updates Q function forms, according to formula Q_k+1(s, a)=Q_k(s,a)+αδ_ke_k(s,a)；

Wherein, e_k(s, a) walks the eligibility trace of iteration for kth under acting a in state s, and γ is discount factor, and γ span is 0 <γ<1, λ is mark decay factor, and λ span is 0<λ<1, α is Q learning rates, and α sets scope to be 0<α<1.

8. the strategy side of wolf pack clan according to claim 1 based on multiple agent random unanimously game and virtual generating clan Method, it is characterised in that：Mixed strategy U in the step S7_k(s_k,a_k) updated according to following formula：

In formula, φ_iFor learning rate changing.

Q_k+1(s_k,a_k)=Q_k+1(s_k,a_k)+αp_k

Update learning rate changingAccording to formula：

Average mixed strategy table is updated, according to formula：

In formula,WithTwo learning parameters are used for representing the win of intelligent body and defeated, visit (s_k) be from original state to work as The s that preceding state is undergone_kNumber of times.

9. the strategy side of wolf pack clan according to claim 1 based on multiple agent random unanimously game and virtual generating clan Method, it is characterised in that：Step S8, determines Laplace formula L=[l_ij]∈R^n×n, according to formula

In formula, constant b_ij(b_ij>=weight factor between intelligent body 0) is represented, climbing slope is calculated, according to formula：

In formula,It is the climbing power of unit, UR_iwAnd D_iwIt is the bound of climbing slope respectively.

10. the wolf pack clan according to claim 1 based on the random consistent game of multiple agent and virtual generating clan is tactful Method, it is characterised in that：Step S9, is updated to leader and the virtual uniformity variable that followed by, according to formula：

The former is that leader's uniformity variable is updated, and the latter is that the virtual uniformity variable that followed by is updated. In formula, in i-th of VGT, m_iIt is unit sum,It is μ in random row matrix, formula_i>0 represents i-th The Dynamic gene of VGT power deviations, △ P_error-iRepresent the deviation of i-th of VGT general powers instruction and all aggregate capacities；

Calculate power of the assembling unit △ P_iwIf the power of the assembling unit is out-of-limit, power deviation is calculated, judges whether to meet condition, then obtains Since the power of the assembling unit, carry out repeatedly calculating k=k+1 next time, deviation is unsatisfactory for condition, then above step meter repeated calculating uniformity Calculate；When reaching boundary condition, generated output △ P are calculated_iwAnd t_iw, according to formula:

Row stochastic matrix element is updated, according to formula：

In formula, L=[l_ij]∈R^n×nIt is Laplacian Matrix, constant b_ij(b_ij>=0) represent weight factor D=between intelligent body [d_ij]∈R^n×nRepresent row stochastic matrix,It is i-th of VGT weighted adjacent matrix

The deviation △ P of aggregate capacity_error-iCalculation formula：

If Δ P_i>0, thenOtherwise

Judge whether power deviation meets condition, meet then progress next iteration calculating and take k=k+1；It is unsatisfactory for, jumps to meter The step of calculating uniformity.