CN116054285A

CN116054285A - Transmission and distribution frequency modulation resource cooperative control method based on federal reinforcement learning algorithm

Info

Publication number: CN116054285A
Application number: CN202211728739.5A
Authority: CN
Inventors: 陈然; 许汉平; 周蠡; 蔡杰; 贺兰菲; 周英博; 李吕满; 张赵阳; 廖晓红; 熊一; 孙利平; 熊川羽
Original assignee: Economic and Technological Research Institute of State Grid Hubei Electric Power Co Ltd
Current assignee: Economic and Technological Research Institute of State Grid Hubei Electric Power Co Ltd
Priority date: 2022-12-30
Filing date: 2022-12-30
Publication date: 2023-05-02

Abstract

A transmission and distribution frequency modulation resource cooperative control method based on a federal reinforcement learning algorithm comprises the following steps: dividing a regional power grid into a main net area and a plurality of net distribution areas; setting an agent in a dispatching center of each zone, and establishing a corresponding DQN neural network model for each agent; each intelligent agent uses the data of the corresponding patch to carry out local training on the corresponding DQN neural network model, encrypts the information of the DQN neural network model after the local training, and uploads the encrypted information to the aggregation center; the aggregation center carries out gradient average processing on the encryption information and returns the encryption information to each intelligent agent, each intelligent agent carries out subsequent training on the locally trained DQN neural network model by utilizing the information after gradient average processing to obtain a trained DQN neural network model, and a frequency modulation instruction of each unit which is received and scheduled in the regional power grid is obtained through the trained DQN neural network model. The design can ensure safe and efficient operation of the regional power grid and privacy safety of frequency modulation users.

Description

Transmission and distribution frequency modulation resource cooperative control method based on federal reinforcement learning algorithm

Technical Field

The invention belongs to the field of automatic power generation control of power systems, and particularly relates to a transmission and distribution frequency modulation resource cooperative control method based on a federal reinforcement learning algorithm.

Background

The automatic power control (automatic power control, APC) is extended to the adjustable load side continuously on the basis of the traditional generator set, covers the original AGC technical connotation of the generator set, and also confirms the frequency modulation capability of flexible resources. Most of the flexible resources are accessed through a distribution network, and the development of communication and other technologies is carried out, so that the distribution network is gradually changed into a local power network with self-balancing capability from an original unidirectional power receiving network, and the relationship between the main power distribution network and the distribution network is also changed into a mutual supporting bidirectional interaction relationship from an original principal-subordinate attachment relationship. The traditional frequency modulation resources such as thermal power and hydropower are mostly connected to the main network side, the topology of the power grid at the side is relatively simple in structure compared with a distribution network, and the distributed power source is mostly connected to the distribution network side, and the distributed power source receives the power after being scheduled to increase or reduce the power, so that the influence on the operation of the power distribution network is not negligible, and the main function of the power distribution network is to provide reliable electric energy for users all the time. In this context, how to combine these power characteristics with the different resources of the environment to participate in APC is a challenge in exploring the development of new power systems.

The current closed loop control process of the regional power grid APC is mainly divided into 2 processes: 1) Collecting the frequency deviation and the tie line power deviation of a power grid, calculating a real-time regional control deviation ACE, and obtaining a total generated power instruction through a PI controller; 2) And distributing the instruction to each APC unit by using a related power distribution method. At present, the total regulating power command is mainly distributed according to the adjustable capacity of the unit, but the strategy cannot meet the optimal control requirement of the system. Meanwhile, the traditional centralized control has large calculated amount, centralized communication and poor reliability, and cannot adapt to the active power distribution network structure with flexible and changeable structure, so that the control mode of the system is gradually changed from centralized control to distributed control, but the distributed control of each frequency modulation unit installation intelligent body is difficult to realize the integral optimization of an autonomous area due to the large dispersion characteristic of a distributed power supply. In addition, the increasing number of distributed power sources has led to an increased trend towards multiple principals, and multi-principal privacy has also been a threat. Based on the existing problems of distributed control, a flexible power optimal allocation strategy is needed to ensure safe and efficient operation of regional power grids and privacy security of frequency modulation users.

Disclosure of Invention

The invention aims to overcome the problems that the existing control method in the prior art is difficult to meet the optimal control requirement of a system, cannot adapt to an active power distribution network structure with flexible and changeable structures and the privacy security of frequency modulation users faces threat, and provides a transmission and distribution frequency modulation resource cooperative control method based on a federal reinforcement learning algorithm, which can meet the optimal control requirement of the system, adapt to the active power distribution network structure with flexible and changeable structures and ensure the safe and efficient operation of regional power grids and the privacy security of the frequency modulation users.

In order to achieve the above object, the technical solution of the present invention is:

a cooperative control method for frequency modulation resources of transmission and distribution based on a federal reinforcement learning algorithm comprises the following steps:

s1, dividing a regional power grid into a main net area and a plurality of net distribution areas;

s2, setting an agent in a dispatching center of each zone, and establishing a corresponding DQN neural network model for each agent;

s3, each agent respectively uses local data of a corresponding patch to perform local training on the corresponding DQN neural network model, performs homomorphic encryption on the information of the DQN neural network model after the local training, and uploads the encrypted information to the aggregation center;

S4, the aggregation center carries out gradient average processing on all the encrypted information, sends the information subjected to gradient average processing to each intelligent agent, receives the information subjected to gradient average processing, carries out subsequent training on the corresponding local trained DQN neural network model according to the information subjected to gradient average processing, and obtains a trained DQN neural network model, and the frequency modulation instruction of each unit which is subjected to scheduling is obtained through the trained DQN neural network model.

In the step S3, when each agent uses the local data of the corresponding patch to perform local training on the DQN neural network model, each agent state space, action space and rewarding function are set according to the markov decision process;

the state space for setting the z-number agent specifically comprises:

the size of a total frequency adjusting instruction for determining the total deviation of frequency response in the frequency allocation process is taken as the state space of the z-number intelligent agent;

the state of the z-number agent at the time t is S _z,t ；

The action space for setting the z-number intelligent agent specifically comprises:

set up action space A that z number agent can decision _z All control behaviors of the z-number agent are performed in the action space A _z Selecting;

control behavior a of z-number intelligent agent at t moment _z,t Can be expressed as:

in the formula (1):

active output of the o-th thermal power unit controlled by the z-number intelligent agent at the time t; />

The active force of the mth energy storage device controlled by the z-number intelligent agent at the time t; />

The active output of the nth wind turbine generator set controlled by the z-number intelligent agent at the time t; />

The active force of the jth electric automobile group controlled by the z-number intelligent agent at the t moment;

the setting of the rewarding function of the z-number agent specifically comprises the following steps:

setting rewards of the environment on the control behaviors of the z-number intelligent agent, aiming at minimizing deviation of the adjustment power instruction value and the power response value, and constructing a rewarding function of the z-number intelligent agent:

in the formulae (2) to (3), R _z,t The rewarding function of the z-number intelligent agent at the time t; q is the number of control periods; q is the number of APC units in the patch corresponding to the z-number intelligent agent; i is the ith APC unit in the patch corresponding to the z-number agent; t is the t-th discrete control period; deltaP _i ^G The power command value is adjusted for the input of the ith APC unit in the zone corresponding to the z-number intelligent agent; deltaP _i ^R The power response value of the ith APC unit in the patch corresponding to the z-number intelligent agent;

the cost function for the objective function minF is obtained by discounted cumulative summation:

In the formula (4), the amino acid sequence of the compound,

to control the behavior a _z All jackpots generated are averaged; gamma ray ^t' ∈[0，1]，γ ^t' Is a discount coefficient; r is R _z,t' Is the accumulation of bonus functions corresponding to a plurality of consecutive behaviors of the z-number agent.

In the step S3, the local training of the corresponding DQN neural network model by the z-number agent using the local data of the corresponding patch specifically includes:

s31, initializing current network parameters of a corresponding DQN neural network model by a z-number agent, and copying a target network with the same structure;

s32, training the DQN neural network model by the z-number agent according to the state data of 96 time periods in the day of the corresponding patch, and updating the parameters of the target network.

In the step S32, training the DQN neural network model by the z-agent with the state data of 96 time periods within a day corresponding to the patch includes:

s321, selecting state data of a time period from state data of 96 time periods in the day as the current state S of the z-number intelligent agent _t ；

S322, current state S based on z number agent _t Trial-error is performed by adopting an epsilon-greedy strategy, namely, the control behavior a is selected by using a random strategy with probability epsilon _t Selecting current optimal control behavior with probability 1-epsilon

a _t 、/>

Wherein:

S323, calculating the magnitude r of the rewarding function after executing the control behavior a according to the selected control behavior a and the current network in the DQN neural network model _t And the Q value is calculated by the following function:

Q(s _t ,a _t )＝Q(s _t ,a _t )+η[r _t +μmaxQ(s _t+1 ,a _t+1 )-Q(s _t ,a _t )] (6)；

in the formulas (5) and (6), Q(s) _t ,a _t ) Is the current Q value; max Q(s) _t+1 ,a _t+1 ) Is the target Q value; η is the learning rate; mu is a reward attenuation coefficient;

s324, according to the selected control behavior a, acquiring the next state S returned by the environment after the z-number intelligent agent executes the selected control behavior a _t+1 Obtaining an empirical sample (s _t ，a，r _t ，s _t+1 ) And the empirical sample (s _t ，a，r _t ，s _t+1 ) Storing the experience playback pool;

s325, updating the current state of the z-number agent to be the next state returned by the environment, and repeating the steps S322-S324 until the experience playback pool is full;

s326, after the experience playback pool is full, extracting omega experience samples from the experience playback pool for calculation, and updating the loss function:

in the formula (7): f (F) _z As a loss function; r is (r) _i,z A reward function for the z-number agent; q(s) _z,i ,a _z,i ) The Q value of the current network;

the target Q value corresponding to the experience sample.

In the step S3, the addition homomorphic encryption is performed on the information of the locally trained DQN neural network model, and the uploading aggregation center of the encrypted information specifically includes:

S34, each intelligent agent encrypts a corresponding loss function in the locally trained DQN neural network model by using the Paillier full homomorphic encryption public key K to obtain an encrypted loss function;

s35, each agent transmits the encrypted loss function to the aggregation center.

In step S4, the aggregation center performs gradient average processing on all the encrypted information, and sends the information after gradient average processing to each agent, each agent receives the information after gradient average processing, and performs subsequent training on the corresponding locally trained DQN neural network model according to the information after gradient average processing, which specifically includes:

s41, the aggregation center calculates and obtains a comprehensive loss function according to the encrypted loss function sent by each intelligent agent

In formula (8):

representing the summation of multiple encrypted loss functions, R _y,z Bonus function for agent z +.>

The Q value of the target network corresponding to the z-number intelligent agent; q(s) _z,i ,a _z,i ) The current Q value corresponding to the z-number intelligent agent; η is the learning rate; mu is a reward attenuation coefficient; y is the total number of agents;

s42, the aggregation center synthesizes the loss function

Transmitting to each agent according to the current network and the comprehensive loss function in the corresponding local trained DQN neural network model >

Calculating gradient information;

s43, each agent adds a safety mask on the gradient information, and transmits the gradient information with the safety mask to the aggregation center;

s44, after receiving the gradient information added with the security mask, the aggregation center releases homomorphic encryption of the gradient information, and returns a result after releasing homomorphic encryption to the corresponding intelligent agent;

s45, each agent receives the result after the full homomorphic encryption is released, and removes the subnet mask in the result to obtain gradient information without encryption, and each agent updates the corresponding parameter theta of the current network in the local trained DQN neural network model through the gradient information without encryption _z,t 。

In the step S45, the corresponding local trained parameters θ of the current network in the DQN neural network model are updated _z,t The formula of (2) is:

in the formula (9): f is a loss function, θ _z,t Updated current network parameters for z-number agent, θ _z,t-1 The current network parameters before the z-number agent is updated.

The control behavior in the action space accords with the power characteristic constraint condition and the system balance constraint condition.

The power supply characteristic constraint condition specifically includes:

thermal power generating unit operation constraint:

in the formula (10), the amino acid sequence of the compound,

the upper limit and the lower limit of the output of the ith thermal power unit are respectively set; / >

The climbing rate of the ith thermal power generating unit; />

The output of the ith thermal power unit at the moment t; />

The output of the ith thermal power unit at the t-1 moment;

energy storage device operation constraints:

in the formula (11), the amino acid sequence of the compound,

capacity of the mth energy storage deviceA constraint range; />

The capacity of the mth energy storage device at the t moment; />

The charging power of the mth energy storage device at the t moment; />

A charging power constraint range for the mth energy storage device; />

The discharge power of the mth energy storage device at the t moment; />

And->

The discharge power constraint range of the mth energy storage device; />

The battery capacity of the mth energy storage device at the time t+1; />

The self-discharge efficiency of the mth energy storage device; />

Charging efficiency of the mth energy storage device; />

The discharge efficiency of the mth energy storage device;

running constraint of distributed wind farm units:

in the formula (12), the amino acid sequence of the compound,

the lower limit of the output power of the nth wind driven generator; />

The upper limit of the output power of the nth wind driven generator; />

The output power of the nth wind driven generator at the t moment;

electric automobile group state constraint:

/>

in the formulae (13) to (16),

and->

The SOC constraint range of the jth electric automobile; />

SOC of the j-th electric automobile; />

Outputting a constraint range of power increment for an M-th electric vehicle charging station; / >

The output power increment of the charging station of the M-th electric vehicle; />

A period of time for accessing a charging station for a single electric vehicle; />

The upper limit of the charge and discharge power of the jth electric automobile at the t moment; />

The lower limit of the charging and discharging power of the jth electric automobile at the t moment; />

And->

Is influenced by the number j of vehicles in the charging station, the SOC capacity of a single electric vehicle and the charge and discharge state factors of the single electric vehicle; />

The SOC of the j-th electric automobile at the t moment; />

Rated charging power for the j-th electric automobile;

rated discharge power of the j-th electric automobile; />

The output power increment of the charging pile of the M-th electric automobile at the t moment is set;/>

the upper limit of the output power increment of the charging pile of the M-th electric automobile at the moment t; />

The lower limit of the output power increment of the charging pile of the M-th electric automobile at the moment t; />

The power of the j electric automobile at the t moment.

The system balance constraint condition specifically comprises:

in the formulae (17) to (19),

respectively representing the t moment as the active output of the ith thermal power unit, the mth energy storage device, the nth wind motor group and the jth electric automobile; n (N) _g 、N _b 、N _w 、N _e Respectively representing the number of thermal power units, energy storage equipment, wind power units and electric automobiles; />

Load disturbance at time t is represented; o1 (l) =b is a set of branches with a head-end node being node b; o2 (l) =b is the set of branches with end node being node b; p (P) _l,t And Q _l,t The active power and the reactive power of the branch circuit l at the time t are respectively;r _l and x _l The resistance and reactance of branch l, U ₀ The voltage amplitude of the relaxation node at the time t; u (U) _b,t The voltage amplitude of the node b at the time t; />

Active power of a generator set connected with the node b; />

Reactive power of a generator set connected with the node b; />

Active power for a load connected to node b; />

Reactive power for a load connected to node b; p (P) _l,max And P _l,min The upper limit and the lower limit of the active power of the branch circuit l are respectively; q (Q) _l,max And Q _l,min The upper limit and the lower limit of reactive power of the branch I are respectively; u (U) _b,max And U _b,min The upper and lower limits of the voltage at node b, respectively.

Compared with the prior art, the invention has the beneficial effects that:

1. according to the transmission and distribution frequency modulation resource cooperative control method based on the federal reinforcement learning algorithm, a regional power grid is divided into a main net zone and a plurality of distribution net zones, a piece of intelligent agent is arranged in a dispatching center of each piece of the regional power grid, the corresponding piece of intelligent agent is used for controlling, the combined frequency modulation of a traditional unit represented by the main net side and multiple types of generator sets of distributed power sources on the distribution net side is realized, the frequency modulation instruction distribution of the dispatching unit is realized by utilizing information interaction among the piece of intelligent agent, and the integral optimization of an autonomous region is realized. Therefore, in the design, the main distribution network structure of the regional power grid is fully utilized to divide the areas and set the intelligent bodies, the frequency modulation instruction distribution of the dispatching unit is realized by utilizing the information interaction among the intelligent bodies, and the integral optimization of the autonomous region is realized.

2. According to the transmission and distribution frequency modulation resource cooperative control method based on the federal reinforcement learning algorithm, the problem of cooperation among multiple intelligent agents is solved by using the federal reinforcement learning distributed algorithm, when each DQN neural network is trained, 96 point states in the day of the corresponding distribution network are adopted for offline training, the online decision time is effectively shortened by using the offline training, and the real-time performance of instruction execution is further improved. Therefore, in the design, the online decision time is effectively shortened by utilizing offline training, and the real-time performance of instruction execution is improved.

3. In the method, when adding homomorphic encryption is carried out on information of a local model of a DQN neural network and uploading the encrypted information to an aggregation center, each agent respectively encrypts a corresponding loss function by using a Paillier homomorphic encryption public key to obtain an encrypted loss function and transmits the encrypted loss function to the aggregation center, the aggregation center calculates the comprehensive loss function according to the encrypted loss function transmitted by each agent and transmits the comprehensive loss function to each agent, each agent respectively calculates gradient information of a current network in the local model of the DQN neural network corresponding to the comprehensive loss function, then each agent adds a security mask on the gradient information and transmits the gradient information added with the security mask to the aggregation center, the aggregation center receives the gradient information added with the security mask and then releases the homomorphic encryption of the gradient information and returns a result of the release of the homomorphic encryption to the corresponding agent, each agent updates a current parameter of the DQN neural network through the result of the release of the encryption and transmits the data in the local model of the DQN neural network, and the method is only used for the mutual encryption of the data is prevented from being leaked by the data in the method, and the mutual encryption risk of the data is prevented. Therefore, only the parameters of the model are transmitted and processed in the design, and the model parameters are encrypted by using the fully homomorphic encryption public key, so that the risk of data leakage in transmission and storage is avoided, and the privacy security of the frequency modulation user is ensured.

4. According to the transmission and distribution frequency modulation resource cooperative control method based on the federation reinforcement learning algorithm, an DQN training model is adopted for a neural network training framework of the federation reinforcement learning, and an optimal strategy is obtained by optimizing a state-action pair value function matrix Q (s, a) of iterative computation, so that the sum of expected discount returns is maximum, and the DQN training model can be well integrated into a multi-source cooperative frequency control framework of a main distribution network, and is more suitable for solving the distributed optimization problem of dynamic distribution of APC power. Therefore, the DQN training model is adopted in the design, so that the method is more suitable for solving the distributed optimization problem of dynamic distribution of the APC power.

Drawings

Fig. 1 is a frame diagram of a federal reinforcement learning algorithm-based transmission and distribution frequency modulation resource cooperative control system.

FIG. 2 is a schematic diagram of a federal reinforcement learning framework based on a DQN training network.

Detailed Description

The invention is described in further detail below with reference to the accompanying drawings and detailed description.

Referring to fig. 1 to 2, a cooperative control method for frequency modulation resources of a transmission and distribution system based on a federal reinforcement learning algorithm, the control method comprises the following steps:

the dispatching centers of each patch are provided with an agent, the dispatching center of each patch is provided with an DQN neural network model, the agent positioned in the dispatching center of the same patch corresponds to the DQN neural network model, and the agent only trains, controls and operates the DQN neural network model corresponding to the agent.

S3, each agent respectively uses local data of a corresponding patch to perform local training on a corresponding DQN neural network model, performs addition homomorphic encryption on the information of the DQN neural network after the local training, and uploads the encrypted information to a polymerization center;

after the aggregation center obtains the total frequency modulation instruction, a training task, namely the frequency modulation instruction distribution of each unit, is issued, and after the aggregation center issues the training task, all the agents begin to execute the same task, namely, each agent utilizes the local data of each area to perform local training on the DQN neural network model.

S4, the aggregation center carries out gradient average processing on all the encrypted information, sends the information subjected to gradient average processing to each intelligent agent, receives the information subjected to gradient average processing, carries out subsequent training on the locally trained DQN neural network according to the information subjected to gradient average processing, and obtains a corresponding trained DQN neural network model, and the intelligent agent obtains a frequency modulation instruction of each unit which is subjected to scheduling in a corresponding patch area through the corresponding trained DQN neural network model.

the defining the state space of the agent specifically comprises:

the state space for setting the z-number agent specifically comprises:

the amplitude of the frequency variation is divided into eight intervals:

{[∞,-0.2),[-0.2,-0.15),[-0.15,-0.10),[-0.10,0.003),[0.03,0.10),[0.10,0.15),[0.15,0.2),[0.2,+∞)}；

the state of the z-number agent at the time t is S _z,t ，S _z,t ＝{S ₁ ,S ₂ ,S ₃ ,S ₄ ,S ₅ ,S ₆ ,S ₇ ，S ₈ S, where S ₁ 、S ₈ Respectively representing states corresponding to the minimum value and the maximum value of the total frequency adjustment instruction of the system under a certain disturbance type;

the action space of the definition agent specifically comprises:

in the formula (1):

The active force of the mth energy storage device controlled by the z-number intelligent agent at the time t; / >

the control behavior of the agent needs to meet constraints of the action space. Constraints of the action space include the following two classes: taking the power characteristic constraint of the dynamic response transmission process of the unit into consideration; the system balance constraint of the whole stable operation of the power system is considered, wherein the constraint condition of the system balance constraint mainly considers the difference between the joint frequency modulation of the main distribution network resources and the conventional multi-source collaborative frequency modulation.

The power supply characteristic constraint condition specifically includes:

thermal power generating unit operation constraint:

in the formula (10), the amino acid sequence of the compound,

the upper limit and the lower limit of the output of the ith thermal power unit are respectively set; />

The climbing rate of the ith thermal power generating unit; />

The output of the ith thermal power unit at the moment t; />

The output of the ith thermal power unit at the t-1 moment;

energy storage device operation constraints:

in the formula (11), the amino acid sequence of the compound,

the minimum capacity of the mth energy storage device; />

Maximum capacity of the mth energy storage device; />

The battery capacity of the mth energy storage device at the t moment; />

The charging power of the mth energy storage device at the t moment;

a charging power constraint range for the mth energy storage device; / >

The discharge power of the mth energy storage device at the t moment; />

And->

The discharge power constraint range of the mth energy storage device; />

The battery capacity of the mth energy storage device at the time t+1; />

The self-discharge efficiency of the mth energy storage device; />

Charging efficiency of the mth energy storage device; />

The discharge efficiency of the mth energy storage device;

running constraint of distributed wind farm units:

in the formula (12), the amino acid sequence of the compound,

the lower limit of the output power of the nth wind driven generator; />

The upper limit of the output power of the nth wind driven generator; />

The output power of the nth wind driven generator at the t moment;

electric automobile group state constraint:

/>

in the formulae (13) to (16),

and->

The SOC constraint range of the jth electric automobile; />

SOC of the j-th electric automobile; />

Outputting a constraint range of power increment for an M-th electric vehicle charging station; />

And->

Is influenced by the number j of vehicles in the charging station, the SOC capacity of a single electric vehicle and the charge and discharge state factors of the single electric vehicle; / >

The SOC of the j-th electric automobile at the t moment; />

Rated charging power for the j-th electric automobile;

rated discharge power of the j-th electric automobile; />

The output power increment of the charging pile of the M-th electric automobile at the t moment is set; />

The power of the j electric automobile at the t moment.

The system balance constraint condition specifically comprises:

wherein, the formula (17) is a system power balance constraint, and the formulas (18) and (19) are distribution network operation related constraints;

in the formulae (17) to (19),

Load disturbance at time t is represented; o1 (l) =b is a set of branches with a head-end node being node b; o2 (l) =b is the set of branches with end node being node b; p (P) _l,t And Q _l,t The active power and the reactive power of the branch circuit l at the time t are respectively; r is (r) _l And x _l The resistance and reactance of branch l, U ₀ The voltage amplitude of the relaxation node at the time t; u (U) _b,t The voltage amplitude of the node b at the time t; u (U) _b+1,t The voltage amplitude of the node b at the time t+1; />

Active power of a generator set connected with the node b; />

Reactive power of a generator set connected with the node b; />

Active power for a load connected to node b; />

setting rewards of environment on control behaviors of the z-number intelligent agent, minimizing deviation of the adjustment power command value and the power response value by changing the control behaviors of the z-number intelligent agent, namely, aiming at minimizing deviation of the adjustment power command value and the power response value, and constructing an objective function minF and a rewarding function R of the z-number intelligent agent _z,t ：

The bonus function is:

in the formulae (2) to (3), R _z,t The rewarding function of the z-number intelligent agent at the time t; q is the number of control periods; q is the number of APC units in the patch corresponding to the z-number intelligent agent; i is the ith APC unit in the patch corresponding to the z-number agent; t is the t-th discrete control period; deltaP _i ^G The power command value is adjusted for the input of the ith APC unit in the zone corresponding to the z-number intelligent agent; deltaP _i ^R In the region corresponding to the z-number intelligent agentThe power response value of the ith APC unit;

the cost function for the objective function min F is obtained by discounted cumulative summation:

in the formula (4), the amino acid sequence of the compound,

for z-number intelligent agent in control action a _z A function of making a corresponding reward for the control action; />

To control the behavior a _z All jackpots generated are averaged; gamma ray ^t' ∈[0，1]，γ ^t' Is a discount coefficient; r is R _z，t' Is the accumulation of bonus functions corresponding to a plurality of consecutive behaviors of the z-number agent.

In the step S3, the local training of the DQN neural network model by the z-number agent using the local data of the corresponding patch specifically includes:

s31, initializing a current network parameter theta of a corresponding DQN neural network model by an agent _1,t 、θ _2,t ...θ _z,t And copy a target network with the same structure as the current network

Dividing every 15 minutes in 24 hours into a time period, adding 96 time periods, acquiring state data of a patch corresponding to the z-number intelligent agent in the 96 time periods, namely, the state data of 96 time periods in the day, and training a corresponding DQN neural network model by using the data. The intelligent agent trains the DQN neural network model for a plurality of times, and immediately copies the current network parameters to the target network after training the DQN neural network model once by each intelligent agent, and updates the parameters of the target network.

s321, the z-number intelligent agent acquires state data of 96 time periods in the day of the corresponding power grid sheet area, and selects the state data of one time period in the 96 point states in the day as the current state S of the z-number intelligent agent _t ；

a _t 、/>

Wherein:

equation (5) represents selecting an optimal Q value as a current Q value;

s323, calculating the magnitude r of the rewarding function after executing the control behavior a according to the selected control behavior a and the current network in the DQN neural network model _t And the corresponding Q value is calculated by the following function:

in the formulas (5) and (6), Q(s) _t ,a _t ) The Q value of the current network; maxQ(s) _t+1 ,a _t+1 ) The Q value of the target network; η is the learning rate; m is a reward attenuation coefficient; r is (r) _t ∈{r _1,t ,r _m,t ,...,r _n,t }，{r _1,t ,r _m,t ,...,r _n,t The z number agent rewards function set;

in the formula (7): f (F) _z As a loss function; r is (r) _i,z Is a reward function; q(s) _z,i ,a _z,i ) The Q value of the current network corresponding to the experience sample;

the Q value of the target network corresponding to the experience sample; s is(s) _z,i ,s _z,i+1 ∈S _z,i The state of the current action and the state of the target network action belong to a state space set of the intelligent agent z; a, a _i ,a _z,i+1 ∈A _z,i Indicating that both the current action and the target network action belong to the set of action spaces of agent z.

In the step S3, the addition homomorphic encryption is performed on the information of the DQN neural network after the local training, and the uploading aggregation center of the encrypted information specifically includes:

s34, each agent encrypts the corresponding loss function in the local trained DQN neural network model by using the Paillier full homomorphic encryption public key K to obtain the encrypted loss function

Representing the fully homomorphic encrypted result; / >

S35, each agent encrypts the loss function

To the aggregation center.

In the step S4, the aggregation center performs gradient average processing on all the encrypted information, and sends the information after gradient average processing to each agent specifically includes:

the aggregation center calculates and obtains a comprehensive loss function according to the encrypted loss function sent by each agent

And the comprehensive loss function->

Sending the data to each intelligent agent:

in formula (8):

The Q value of the target network corresponding to the z-number intelligent agent; q(s) _z,i ,a _z,i ) The current network Q value corresponding to the z-number intelligent agent; η is the learning rate; m is a reward attenuation coefficient; y is the total number of agents;

each agent receives the information after gradient average treatment, and carries out subsequent training on the corresponding local trained DQN neural network according to the information after gradient average treatment, and specifically comprises the following steps:

receiving loss function of each agent

And calculating the current network relative comprehensive loss function in the corresponding local trained DQN neural network model>

Gradient information of (2);

each intelligent agent adds a safety mask on the gradient information and transmits the gradient information added with the safety mask to the aggregation center;

The aggregation center receives the gradient information added with the security mask, then releases homomorphic encryption of the gradient information, and returns the result after releasing homomorphic encryption to the corresponding intelligent agent;

each agent receives the result after the full homomorphic encryption is released, and removes the subnet mask in the result to obtain gradient information without encryption, and each agent updates the corresponding parameter theta of the current network in the local trained DQN neural network model through the gradient information without encryption _z,t 。

Updating the corresponding local trained parameters theta of the current network in the DQN neural network model _z,t The formula of (2) is:

Federal reinforcement learning follows a Markov decision process (markov decision process, MDP) with DQN as the training network neural model, and the MDP followed by each agent can be expressed as a tuple (z, s) _t ，a _t ，r，s _t+1 ) Wherein z is the agent number; s is(s) _t When it is the agent tA state of engraving; a, a _t Control actions executed for the time t of the intelligent agent; r is the state s of the intelligent agent _t Executing action a _t A reward obtained later; s is(s) _t+1 For the intelligent agent in state s _t Executing control action a _t And then transitions to the next time state. During federal reinforcement learning, as shown in fig. 2, an initial state is selected randomly, then a control action is selected based on the initial state, after the control action is selected, the agent will execute the control action in the environment, and then the environment will return to the next time state s _t+1 And the prize r obtained, in which case the quadruple (z, s _t ，a _t ，r，s _t+1 ) And storing into an experience pool. The next time state s will then be _t+1 Regarded as the current state s _t Repeating the above steps until the experience pool is full. And then, carrying out gradient average on error gradient functions in Q network training on a plurality of agents through an aggregation center, and returning information after gradient average to follow-up training of each agent by the aggregation center to guide, so that the plurality of agents train the same task and carry out information interaction.

The principle of the invention is explained as follows:

the intelligent agent comprises a main network intelligent agent and a distribution network intelligent agent, wherein the patch intelligent agent corresponding to the main network area is the main network intelligent agent, and the patch intelligent agent corresponding to the distribution network area is the distribution network intelligent agent, wherein the main network intelligent agent has the function of an aggregation center in federal reinforcement learning. As shown in fig. 1, a framework diagram of a coordinated control system of frequency modulation resources of transmission and distribution based on a federal reinforcement learning algorithm divides a regional power grid according to structures of a main network and a plurality of distribution networks, and sets a regional intelligent agent in each main network and a distribution network dispatching center. According to the design, on the basis of the basic structure, main function guidance and connection power supply characteristics of the main distribution network, the cooperative control problem of the frequency modulation resources of transmission and distribution is optimized, in the control transfer from the existing centralized control to the distributed mode, the fact that a calculation and information interaction platform is transferred from an original individual and main network dispatching center to a network side represented by a power distribution network side is considered, communication calculation pressure of the main network serving as a high-level dispatching center during dispatching can be relieved, and active control capability of the distribution network dispatching center under the characteristics of localization and activation of a future active power distribution network is fully exerted. Secondly, because the distributed power supply participates in the frequency modulation unit mostly from enterprises and users, the part of people are extremely sensitive to the problem of privacy safety, the federal reinforcement learning algorithm is utilized to solve the problem of cooperation among multiple intelligent agents, and on the premise of ensuring the privacy safety of the users, the online decision time is shortened by utilizing offline training, so that the requirement of distributed execution on real-time decision is met.

The APC unit is a generator unit which automatically tracks a power dispatching instruction in a specified output adjustment range and adjusts power generation/power utilization power in real time according to a certain adjustment rate so as to meet the requirements of active balance, frequency stability and tie line power control of a power system.

Example 1:

s3, each agent respectively uses local data of a corresponding patch to perform local training on the corresponding DQN neural network model, performs addition homomorphic encryption on the information of the DQN neural network model after the local training, and uploads the encrypted information to the aggregation center;

s4, the aggregation center carries out gradient average processing on all the encrypted information, sends the information subjected to gradient average processing to each intelligent agent, receives the information subjected to gradient average processing, carries out subsequent training on the corresponding local trained DQN neural network model according to the information subjected to gradient average processing, obtains a trained DQN neural network model, and obtains a frequency modulation instruction of each unit which is subjected to scheduling in a corresponding patch area through the trained DQN neural network model.

the state space for setting the z-number agent specifically comprises:

the state of the z-number agent at the time t is S _z,t ；

in the formula (1):

in the formula (4), the amino acid sequence of the compound,

a _t 、/>

Wherein:

in the formulas (5) and (6), Q(s) _t ,a _t ) Is the current Q value; maxQ(s) _t+1 ,a _t+1 ) Is the target Q value; η is the learning rate; mu is a reward attenuation coefficient;

s324, according to the selected control behavior a, acquiring the z-number agent to execute the selectedTaking the next state s returned by the environment after the control action a _t+1 Obtaining an empirical sample (s _t ，a，r _t ，s _t+1 ) And the empirical sample (s _t ，a，r _t ，s _t+1 ) Storing the experience playback pool;

the target Q value corresponding to the experience sample.

s41, the polymerization center is based on eachThe encrypted loss function sent by the intelligent agent is calculated to obtain the comprehensive loss function

In formula (8):

s42, the aggregation center synthesizes the loss function

Transmitting to each intelligent agent, each intelligent agent respectively calculating the relative comprehensive loss function of the current network in the corresponding local trained DQN neural network model>

Gradient information of (2);

s45, each agent receives the result after the full homomorphic encryption is released, and removes the subnet mask in the result to obtain gradient information without encryption, each agent passes throughUpdating the corresponding local trained parameter θ of the current network in the DQN neural network model without encrypted gradient information _z,t 。

Example 2:

example 2 is substantially the same as example 1 except that:

The power supply characteristic constraint condition specifically includes:

thermal power generating unit operation constraint:

in the formula (10), the amino acid sequence of the compound,

The climbing rate of the ith thermal power generating unit; />

The output of the ith thermal power unit at the moment t; />

I-th thermal power machine at t-1 momentThe output of the group;

energy storage device operation constraints:

in the formula (11), the amino acid sequence of the compound,

the capacity constraint range of the mth energy storage device; />

The capacity of the mth energy storage device at the t moment; />

The charging power of the mth energy storage device at the t moment; />

A charging power constraint range for the mth energy storage device; />

The discharge power of the mth energy storage device at the t moment; />

And->

The discharge power constraint range of the mth energy storage device; />

The battery capacity of the mth energy storage device at the time t+1; />

The self-discharge efficiency of the mth energy storage device; / >

Charging efficiency of the mth energy storage device; />

The discharge efficiency of the mth energy storage device;

running constraint of distributed wind farm units:

in the formula (12), the amino acid sequence of the compound,

the lower limit of the output power of the nth wind driven generator; />

The upper limit of the output power of the nth wind driven generator; />

The output power of the nth wind driven generator at the t moment;

electric automobile group state constraint:

in the formulae (13) to (16),

and->

The SOC constraint range of the jth electric automobile; />

SOC of the j-th electric automobile; />

And->

The SOC of the j-th electric automobile at the t moment; />

Rated charging power for the j-th electric automobile;

rated discharge power of the j-th electric automobile; />

The output power increment of the charging pile of the M-th electric automobile at the t moment is set; / >

The power of the j electric automobile at the t moment.

The system balance constraint condition specifically comprises:

in the formulae (17) to (19),

respectively representing the t moment as the active output of the ith thermal power unit, the mth energy storage device, the nth wind motor group and the jth electric automobile; n (N) _g 、N _b 、N _w 、N _e Respectively representing the number of thermal power units, energy storage equipment, wind power units and electric automobiles; p (P) _t ^L Load disturbance at time t is represented; o1 (l) =b is a set of branches with a head-end node being node b; o2 (l) =b is the set of branches with end node being node b; p (P) _l,t And Q _l,t The active power and the reactive power of the branch circuit l at the time t are respectively; r is (r) _l And x _l The resistance and reactance of branch l, U ₀ The voltage amplitude of the relaxation node at the time t; u (U) _b,t The voltage amplitude of the node b at the time t; />

Active power of a generator set connected with the node b; />

Reactive power of a generator set connected with the node b; />

Active power for a load connected to node b; />

The above description is merely of preferred embodiments of the present invention, and the scope of the present invention is not limited to the above embodiments, but all equivalent modifications or variations according to the present disclosure will be within the scope of the claims.

Claims

1. A transmission and distribution frequency modulation resource cooperative control method based on a federal reinforcement learning algorithm is characterized by comprising the following steps of:

the control method comprises the following steps:

2. The federal reinforcement learning algorithm-based transmission and distribution frequency modulation resource cooperative control method according to claim 1, wherein the method is characterized by comprising the following steps of:

the state space for setting the z-number agent specifically comprises:

the state of the z-number agent at the time t is S _z,t ；

set up action space A that z number agent can decision _z All of the z-agentControl actions all from action space A _z Selecting;

in the formula (1):

in the formula (4), the amino acid sequence of the compound,

to control the behavior a _z All jackpots generated are averaged; gamma ray ^t′ ∈[0，1]，γ ^t′ Is a discount coefficient; r is R _z,t' Is the accumulation of bonus functions corresponding to a plurality of consecutive behaviors of the z-number agent.

3. The federal reinforcement learning algorithm-based transmission and distribution frequency modulation resource cooperative control method according to claim 2, wherein the method is characterized by comprising the following steps of:

S31, initializing current network parameters of a corresponding DQN neural network model by a z-number agent, and copying a target network with the same structure as the current network;

4. The method for cooperatively controlling the frequency modulation resources of the transmission and distribution system based on the federal reinforcement learning algorithm according to claim 3, wherein the method comprises the following steps:

a _t 、/>

Wherein:

s323, calculating the magnitude r of the rewarding function after executing the control behavior a according to the selected control behavior a _t And the Q value is calculated by the following function:

in the formulas (5) and (6), Q(s) _t ,a _t ) The Q value of the current network; maxQ(s) _t+1 ,a _t+1 ) The Q value of the target network; η is the learning rate; mu is a reward attenuation coefficient;

/>

is the target network Q value.

5. The federal reinforcement learning algorithm-based transmission and distribution frequency modulation resource cooperative control method according to claim 1, wherein the method is characterized by comprising the following steps of:

in the step S3, homomorphic encryption is performed on the information of the locally trained DQN neural network model, and the uploading aggregation center of the encrypted information specifically includes:

6. The federal reinforcement learning algorithm-based transmission and distribution frequency modulation resource cooperative control method according to claim 5, wherein the method is characterized by comprising the following steps:

In formula (8):

The Q value of the target network corresponding to the z-number intelligent agent; q(s) _z,i ,a _z,i ) The current network Q value corresponding to the z-number intelligent agent; η is the learning rate; mu is a reward attenuation coefficient; y is the total number of agents;

s42, the aggregation center synthesizes the loss function

Calculating gradient information;

7. The federal reinforcement learning algorithm-based transmission and distribution frequency modulation resource cooperative control method according to claim 6, wherein the method is characterized by comprising the following steps:

8. The federal reinforcement learning algorithm-based transmission and distribution frequency modulation resource cooperative control method according to claim 2, wherein the method is characterized by comprising the following steps of:

The control behavior in the action space accords with the power supply characteristic constraint condition and the system balance constraint condition.

9. The federal reinforcement learning algorithm-based transmission and distribution frequency modulation resource cooperative control method according to claim 8, wherein the method is characterized by comprising the following steps:

the power supply characteristic constraint condition specifically includes:

thermal power generating unit operation constraint:

in the formula (10), the amino acid sequence of the compound,

The climbing rate of the ith thermal power generating unit; />

The output of the ith thermal power unit at the moment t; />

The output of the ith thermal power unit at the t-1 moment;

energy storage device operation constraints:

in the formula (11), the amino acid sequence of the compound,

the capacity constraint range of the mth energy storage device; />

The capacity of the mth energy storage device at the t moment; />

The charging power of the mth energy storage device at the t moment; />

A charging power constraint range for the mth energy storage device; />

The discharge power of the mth energy storage device at the t moment; />

And->

The discharge power constraint range of the mth energy storage device; />

The battery capacity of the mth energy storage device at the time t+1; />

The self-discharge efficiency of the mth energy storage device; />

Charging efficiency of the mth energy storage device; />

The discharge efficiency of the mth energy storage device;

running constraint of distributed wind farm units:

In the formula (12), the amino acid sequence of the compound,

the lower limit of the output power of the nth wind driven generator; />

The upper limit of the output power of the nth wind driven generator; />

The output power of the nth wind driven generator at the t moment;

electric automobile group state constraint:

/>

in the formulae (13) to (16),

and->

The SOC constraint range of the jth electric automobile; />

SOC of the j-th electric automobile; />

The upper limit of the charging and discharging power of the jth electric automobile at the t moment; />

And->

The SOC of the j-th electric automobile at the t moment; />

Rated charging power for the j-th electric automobile; />

Rated discharge power of the j-th electric automobile; />

The lower limit of the output power increment of the charging pile of the M-th electric automobile at the moment t; / >

The power of the j electric automobile at the t moment.

10. The federal reinforcement learning algorithm-based transmission and distribution frequency modulation resource cooperative control method according to claim 8, wherein the method is characterized by comprising the following steps:

the system balance constraint condition specifically comprises:

/>

in the formulae (17) to (19),

respectively representing the t moment as the active output of the ith thermal power unit, the mth energy storage device, the nth wind motor group and the jth electric automobile; n (N) _g 、N _b 、N _w 、N _e Respectively representing the number of thermal power units, energy storage equipment, wind power units and electric automobiles; p (P) _t ^L Load disturbance at time t is represented; o1 (l) =b is a set of branches with a head-end node being node b; o2 (l) =b is the set of branches with end node being node b; p (P) _l,t And Q _l,t The active power and the reactive power of the branch circuit l at the time t are respectively; r is (r) _l And x _l Resistors of the branches l, respectivelyAnd reactance, U ₀ The voltage amplitude of the relaxation node at the time t; u (U) _b,t The voltage amplitude of the node b at the time t; />

Active power of a generator set connected with the node b; />

Reactive power of a generator set connected with the node b; />

Active power for a load connected to node b; />

Reactive power for a load connected to node b; p (P) _l,max And P _l,min The upper limit and the lower limit of the active power of the branch circuit l are respectively; q (Q) _l,max And Q _l,min The upper limit and the lower limit of reactive power of the branch I are respectively; u (U) _b,max And U _b,min The upper and lower limits of the voltage at node b, respectively. />