CN116054285A - Transmission and distribution frequency modulation resource cooperative control method based on federal reinforcement learning algorithm - Google Patents
Transmission and distribution frequency modulation resource cooperative control method based on federal reinforcement learning algorithm Download PDFInfo
- Publication number
- CN116054285A CN116054285A CN202211728739.5A CN202211728739A CN116054285A CN 116054285 A CN116054285 A CN 116054285A CN 202211728739 A CN202211728739 A CN 202211728739A CN 116054285 A CN116054285 A CN 116054285A
- Authority
- CN
- China
- Prior art keywords
- agent
- power
- neural network
- network model
- intelligent agent
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000009826 distribution Methods 0.000 title claims abstract description 62
- 238000000034 method Methods 0.000 title claims abstract description 50
- 230000002787 reinforcement Effects 0.000 title claims abstract description 30
- 230000005540 biological transmission Effects 0.000 title claims abstract description 26
- 238000003062 neural network model Methods 0.000 claims abstract description 80
- 238000012549 training Methods 0.000 claims abstract description 53
- 230000002776 aggregation Effects 0.000 claims abstract description 45
- 238000012545 processing Methods 0.000 claims abstract description 32
- 239000003795 chemical substances by application Substances 0.000 claims description 252
- 230000006870 function Effects 0.000 claims description 99
- 238000004146 energy storage Methods 0.000 claims description 57
- 230000006399 behavior Effects 0.000 claims description 47
- 230000009471 action Effects 0.000 claims description 36
- 125000003275 alpha amino acid group Chemical group 0.000 claims description 16
- 150000001875 compounds Chemical class 0.000 claims description 16
- 230000004044 response Effects 0.000 claims description 14
- 230000008569 process Effects 0.000 claims description 12
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000007599 discharging Methods 0.000 claims description 5
- NAWXUBYGYWOOIX-SFHVURJKSA-N (2s)-2-[[4-[2-(2,4-diaminoquinazolin-6-yl)ethyl]benzoyl]amino]-4-methylidenepentanedioic acid Chemical compound C1=CC2=NC(N)=NC(N)=C2C=C1CCC1=CC=C(C(=O)N[C@@H](CC(=C)C(O)=O)C(O)=O)C=C1 NAWXUBYGYWOOIX-SFHVURJKSA-N 0.000 claims description 4
- 238000009825 accumulation Methods 0.000 claims description 4
- 230000009194 climbing Effects 0.000 claims description 4
- 230000001186 cumulative effect Effects 0.000 claims description 4
- 230000005251 gamma ray Effects 0.000 claims description 4
- 238000010977 unit operation Methods 0.000 claims description 4
- 238000013461 design Methods 0.000 abstract description 6
- 238000013528 artificial neural network Methods 0.000 description 10
- 230000003993 interaction Effects 0.000 description 5
- 238000005457 optimization Methods 0.000 description 5
- 238000004891 communication Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 2
- 238000010248 power generation Methods 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 230000001276 controlling effect Effects 0.000 description 1
- 239000006185 dispersion Substances 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000009434 installation Methods 0.000 description 1
- 230000004807 localization Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 230000001105 regulatory effect Effects 0.000 description 1
- 238000003860 storage Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H02—GENERATION; CONVERSION OR DISTRIBUTION OF ELECTRIC POWER
- H02J—CIRCUIT ARRANGEMENTS OR SYSTEMS FOR SUPPLYING OR DISTRIBUTING ELECTRIC POWER; SYSTEMS FOR STORING ELECTRIC ENERGY
- H02J3/00—Circuit arrangements for ac mains or ac distribution networks
- H02J3/38—Arrangements for parallely feeding a single network by two or more generators, converters or transformers
- H02J3/46—Controlling of the sharing of output between the generators, converters, or transformers
- H02J3/48—Controlling the sharing of the in-phase component
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0631—Resource planning, allocation, distributing or scheduling for enterprises or organisations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/06—Energy or water supply
-
- H—ELECTRICITY
- H02—GENERATION; CONVERSION OR DISTRIBUTION OF ELECTRIC POWER
- H02J—CIRCUIT ARRANGEMENTS OR SYSTEMS FOR SUPPLYING OR DISTRIBUTING ELECTRIC POWER; SYSTEMS FOR STORING ELECTRIC ENERGY
- H02J3/00—Circuit arrangements for ac mains or ac distribution networks
- H02J3/24—Arrangements for preventing or reducing oscillations of power in networks
- H02J3/241—The oscillation concerning frequency
-
- H—ELECTRICITY
- H02—GENERATION; CONVERSION OR DISTRIBUTION OF ELECTRIC POWER
- H02J—CIRCUIT ARRANGEMENTS OR SYSTEMS FOR SUPPLYING OR DISTRIBUTING ELECTRIC POWER; SYSTEMS FOR STORING ELECTRIC ENERGY
- H02J3/00—Circuit arrangements for ac mains or ac distribution networks
- H02J3/28—Arrangements for balancing of the load in a network by storage of energy
-
- H—ELECTRICITY
- H02—GENERATION; CONVERSION OR DISTRIBUTION OF ELECTRIC POWER
- H02J—CIRCUIT ARRANGEMENTS OR SYSTEMS FOR SUPPLYING OR DISTRIBUTING ELECTRIC POWER; SYSTEMS FOR STORING ELECTRIC ENERGY
- H02J2203/00—Indexing scheme relating to details of circuit arrangements for AC mains or AC distribution networks
- H02J2203/10—Power transmission or distribution systems management focussing at grid-level, e.g. load flow analysis, node profile computation, meshed network optimisation, active network management or spinning reserve management
-
- H—ELECTRICITY
- H02—GENERATION; CONVERSION OR DISTRIBUTION OF ELECTRIC POWER
- H02J—CIRCUIT ARRANGEMENTS OR SYSTEMS FOR SUPPLYING OR DISTRIBUTING ELECTRIC POWER; SYSTEMS FOR STORING ELECTRIC ENERGY
- H02J2203/00—Indexing scheme relating to details of circuit arrangements for AC mains or AC distribution networks
- H02J2203/20—Simulating, e g planning, reliability check, modelling or computer assisted design [CAD]
-
- H—ELECTRICITY
- H02—GENERATION; CONVERSION OR DISTRIBUTION OF ELECTRIC POWER
- H02J—CIRCUIT ARRANGEMENTS OR SYSTEMS FOR SUPPLYING OR DISTRIBUTING ELECTRIC POWER; SYSTEMS FOR STORING ELECTRIC ENERGY
- H02J2300/00—Systems for supplying or distributing electric power characterised by decentralized, dispersed, or local generation
- H02J2300/20—The dispersed energy generation being of renewable origin
- H02J2300/28—The renewable source being wind energy
Landscapes
- Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Human Resources & Organizations (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Economics (AREA)
- Strategic Management (AREA)
- Power Engineering (AREA)
- Marketing (AREA)
- Software Systems (AREA)
- Tourism & Hospitality (AREA)
- General Engineering & Computer Science (AREA)
- General Business, Economics & Management (AREA)
- Entrepreneurship & Innovation (AREA)
- Bioethics (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Development Economics (AREA)
- Artificial Intelligence (AREA)
- Educational Administration (AREA)
- Data Mining & Analysis (AREA)
- Game Theory and Decision Science (AREA)
- Computer Security & Cryptography (AREA)
- Computer Hardware Design (AREA)
- Operations Research (AREA)
- Quality & Reliability (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Public Health (AREA)
- Water Supply & Treatment (AREA)
- Primary Health Care (AREA)
Abstract
A transmission and distribution frequency modulation resource cooperative control method based on a federal reinforcement learning algorithm comprises the following steps: dividing a regional power grid into a main net area and a plurality of net distribution areas; setting an agent in a dispatching center of each zone, and establishing a corresponding DQN neural network model for each agent; each intelligent agent uses the data of the corresponding patch to carry out local training on the corresponding DQN neural network model, encrypts the information of the DQN neural network model after the local training, and uploads the encrypted information to the aggregation center; the aggregation center carries out gradient average processing on the encryption information and returns the encryption information to each intelligent agent, each intelligent agent carries out subsequent training on the locally trained DQN neural network model by utilizing the information after gradient average processing to obtain a trained DQN neural network model, and a frequency modulation instruction of each unit which is received and scheduled in the regional power grid is obtained through the trained DQN neural network model. The design can ensure safe and efficient operation of the regional power grid and privacy safety of frequency modulation users.
Description
Technical Field
The invention belongs to the field of automatic power generation control of power systems, and particularly relates to a transmission and distribution frequency modulation resource cooperative control method based on a federal reinforcement learning algorithm.
Background
The automatic power control (automatic power control, APC) is extended to the adjustable load side continuously on the basis of the traditional generator set, covers the original AGC technical connotation of the generator set, and also confirms the frequency modulation capability of flexible resources. Most of the flexible resources are accessed through a distribution network, and the development of communication and other technologies is carried out, so that the distribution network is gradually changed into a local power network with self-balancing capability from an original unidirectional power receiving network, and the relationship between the main power distribution network and the distribution network is also changed into a mutual supporting bidirectional interaction relationship from an original principal-subordinate attachment relationship. The traditional frequency modulation resources such as thermal power and hydropower are mostly connected to the main network side, the topology of the power grid at the side is relatively simple in structure compared with a distribution network, and the distributed power source is mostly connected to the distribution network side, and the distributed power source receives the power after being scheduled to increase or reduce the power, so that the influence on the operation of the power distribution network is not negligible, and the main function of the power distribution network is to provide reliable electric energy for users all the time. In this context, how to combine these power characteristics with the different resources of the environment to participate in APC is a challenge in exploring the development of new power systems.
The current closed loop control process of the regional power grid APC is mainly divided into 2 processes: 1) Collecting the frequency deviation and the tie line power deviation of a power grid, calculating a real-time regional control deviation ACE, and obtaining a total generated power instruction through a PI controller; 2) And distributing the instruction to each APC unit by using a related power distribution method. At present, the total regulating power command is mainly distributed according to the adjustable capacity of the unit, but the strategy cannot meet the optimal control requirement of the system. Meanwhile, the traditional centralized control has large calculated amount, centralized communication and poor reliability, and cannot adapt to the active power distribution network structure with flexible and changeable structure, so that the control mode of the system is gradually changed from centralized control to distributed control, but the distributed control of each frequency modulation unit installation intelligent body is difficult to realize the integral optimization of an autonomous area due to the large dispersion characteristic of a distributed power supply. In addition, the increasing number of distributed power sources has led to an increased trend towards multiple principals, and multi-principal privacy has also been a threat. Based on the existing problems of distributed control, a flexible power optimal allocation strategy is needed to ensure safe and efficient operation of regional power grids and privacy security of frequency modulation users.
Disclosure of Invention
The invention aims to overcome the problems that the existing control method in the prior art is difficult to meet the optimal control requirement of a system, cannot adapt to an active power distribution network structure with flexible and changeable structures and the privacy security of frequency modulation users faces threat, and provides a transmission and distribution frequency modulation resource cooperative control method based on a federal reinforcement learning algorithm, which can meet the optimal control requirement of the system, adapt to the active power distribution network structure with flexible and changeable structures and ensure the safe and efficient operation of regional power grids and the privacy security of the frequency modulation users.
In order to achieve the above object, the technical solution of the present invention is:
a cooperative control method for frequency modulation resources of transmission and distribution based on a federal reinforcement learning algorithm comprises the following steps:
s1, dividing a regional power grid into a main net area and a plurality of net distribution areas;
s2, setting an agent in a dispatching center of each zone, and establishing a corresponding DQN neural network model for each agent;
s3, each agent respectively uses local data of a corresponding patch to perform local training on the corresponding DQN neural network model, performs homomorphic encryption on the information of the DQN neural network model after the local training, and uploads the encrypted information to the aggregation center;
S4, the aggregation center carries out gradient average processing on all the encrypted information, sends the information subjected to gradient average processing to each intelligent agent, receives the information subjected to gradient average processing, carries out subsequent training on the corresponding local trained DQN neural network model according to the information subjected to gradient average processing, and obtains a trained DQN neural network model, and the frequency modulation instruction of each unit which is subjected to scheduling is obtained through the trained DQN neural network model.
In the step S3, when each agent uses the local data of the corresponding patch to perform local training on the DQN neural network model, each agent state space, action space and rewarding function are set according to the markov decision process;
the state space for setting the z-number agent specifically comprises:
the size of a total frequency adjusting instruction for determining the total deviation of frequency response in the frequency allocation process is taken as the state space of the z-number intelligent agent;
the state of the z-number agent at the time t is S z,t ;
The action space for setting the z-number intelligent agent specifically comprises:
set up action space A that z number agent can decision z All control behaviors of the z-number agent are performed in the action space A z Selecting;
control behavior a of z-number intelligent agent at t moment z,t Can be expressed as:
in the formula (1):active output of the o-th thermal power unit controlled by the z-number intelligent agent at the time t; />The active force of the mth energy storage device controlled by the z-number intelligent agent at the time t; />The active output of the nth wind turbine generator set controlled by the z-number intelligent agent at the time t; />The active force of the jth electric automobile group controlled by the z-number intelligent agent at the t moment;
the setting of the rewarding function of the z-number agent specifically comprises the following steps:
setting rewards of the environment on the control behaviors of the z-number intelligent agent, aiming at minimizing deviation of the adjustment power instruction value and the power response value, and constructing a rewarding function of the z-number intelligent agent:
in the formulae (2) to (3), R z,t The rewarding function of the z-number intelligent agent at the time t; q is the number of control periods; q is the number of APC units in the patch corresponding to the z-number intelligent agent; i is the ith APC unit in the patch corresponding to the z-number agent; t is the t-th discrete control period; deltaP i G The power command value is adjusted for the input of the ith APC unit in the zone corresponding to the z-number intelligent agent; deltaP i R The power response value of the ith APC unit in the patch corresponding to the z-number intelligent agent;
the cost function for the objective function minF is obtained by discounted cumulative summation:
In the formula (4), the amino acid sequence of the compound,to control the behavior a z All jackpots generated are averaged; gamma ray t' ∈[0,1],γ t' Is a discount coefficient; r is R z,t' Is the accumulation of bonus functions corresponding to a plurality of consecutive behaviors of the z-number agent.
In the step S3, the local training of the corresponding DQN neural network model by the z-number agent using the local data of the corresponding patch specifically includes:
s31, initializing current network parameters of a corresponding DQN neural network model by a z-number agent, and copying a target network with the same structure;
s32, training the DQN neural network model by the z-number agent according to the state data of 96 time periods in the day of the corresponding patch, and updating the parameters of the target network.
In the step S32, training the DQN neural network model by the z-agent with the state data of 96 time periods within a day corresponding to the patch includes:
s321, selecting state data of a time period from state data of 96 time periods in the day as the current state S of the z-number intelligent agent t ;
S322, current state S based on z number agent t Trial-error is performed by adopting an epsilon-greedy strategy, namely, the control behavior a is selected by using a random strategy with probability epsilon t Selecting current optimal control behavior with probability 1-epsilona t 、/>
S323, calculating the magnitude r of the rewarding function after executing the control behavior a according to the selected control behavior a and the current network in the DQN neural network model t And the Q value is calculated by the following function:
Q(s t ,a t )=Q(s t ,a t )+η[r t +μmaxQ(s t+1 ,a t+1 )-Q(s t ,a t )] (6);
in the formulas (5) and (6), Q(s) t ,a t ) Is the current Q value; max Q(s) t+1 ,a t+1 ) Is the target Q value; η is the learning rate; mu is a reward attenuation coefficient;
s324, according to the selected control behavior a, acquiring the next state S returned by the environment after the z-number intelligent agent executes the selected control behavior a t+1 Obtaining an empirical sample (s t ,a,r t ,s t+1 ) And the empirical sample (s t ,a,r t ,s t+1 ) Storing the experience playback pool;
s325, updating the current state of the z-number agent to be the next state returned by the environment, and repeating the steps S322-S324 until the experience playback pool is full;
s326, after the experience playback pool is full, extracting omega experience samples from the experience playback pool for calculation, and updating the loss function:
in the formula (7): f (F) z As a loss function; r is (r) i,z A reward function for the z-number agent; q(s) z,i ,a z,i ) The Q value of the current network;the target Q value corresponding to the experience sample.
In the step S3, the addition homomorphic encryption is performed on the information of the locally trained DQN neural network model, and the uploading aggregation center of the encrypted information specifically includes:
S34, each intelligent agent encrypts a corresponding loss function in the locally trained DQN neural network model by using the Paillier full homomorphic encryption public key K to obtain an encrypted loss function;
s35, each agent transmits the encrypted loss function to the aggregation center.
In step S4, the aggregation center performs gradient average processing on all the encrypted information, and sends the information after gradient average processing to each agent, each agent receives the information after gradient average processing, and performs subsequent training on the corresponding locally trained DQN neural network model according to the information after gradient average processing, which specifically includes:
s41, the aggregation center calculates and obtains a comprehensive loss function according to the encrypted loss function sent by each intelligent agent
In formula (8):representing the summation of multiple encrypted loss functions, R y,z Bonus function for agent z +.>The Q value of the target network corresponding to the z-number intelligent agent; q(s) z,i ,a z,i ) The current Q value corresponding to the z-number intelligent agent; η is the learning rate; mu is a reward attenuation coefficient; y is the total number of agents;
s42, the aggregation center synthesizes the loss functionTransmitting to each agent according to the current network and the comprehensive loss function in the corresponding local trained DQN neural network model >Calculating gradient information;
s43, each agent adds a safety mask on the gradient information, and transmits the gradient information with the safety mask to the aggregation center;
s44, after receiving the gradient information added with the security mask, the aggregation center releases homomorphic encryption of the gradient information, and returns a result after releasing homomorphic encryption to the corresponding intelligent agent;
s45, each agent receives the result after the full homomorphic encryption is released, and removes the subnet mask in the result to obtain gradient information without encryption, and each agent updates the corresponding parameter theta of the current network in the local trained DQN neural network model through the gradient information without encryption z,t 。
In the step S45, the corresponding local trained parameters θ of the current network in the DQN neural network model are updated z,t The formula of (2) is:
in the formula (9): f is a loss function, θ z,t Updated current network parameters for z-number agent, θ z,t-1 The current network parameters before the z-number agent is updated.
The control behavior in the action space accords with the power characteristic constraint condition and the system balance constraint condition.
The power supply characteristic constraint condition specifically includes:
thermal power generating unit operation constraint:
in the formula (10), the amino acid sequence of the compound,the upper limit and the lower limit of the output of the ith thermal power unit are respectively set; / >The climbing rate of the ith thermal power generating unit; />The output of the ith thermal power unit at the moment t; />The output of the ith thermal power unit at the t-1 moment;
energy storage device operation constraints:
in the formula (11), the amino acid sequence of the compound,capacity of the mth energy storage deviceA constraint range; />The capacity of the mth energy storage device at the t moment; />The charging power of the mth energy storage device at the t moment; />A charging power constraint range for the mth energy storage device; />The discharge power of the mth energy storage device at the t moment; />And->The discharge power constraint range of the mth energy storage device; />The battery capacity of the mth energy storage device at the time t+ 1; />The self-discharge efficiency of the mth energy storage device; />Charging efficiency of the mth energy storage device; />The discharge efficiency of the mth energy storage device;
running constraint of distributed wind farm units:
in the formula (12), the amino acid sequence of the compound,the lower limit of the output power of the nth wind driven generator; />The upper limit of the output power of the nth wind driven generator; />The output power of the nth wind driven generator at the t moment;
electric automobile group state constraint:
in the formulae (13) to (16),and->The SOC constraint range of the jth electric automobile; />SOC of the j-th electric automobile; />Outputting a constraint range of power increment for an M-th electric vehicle charging station; / >The output power increment of the charging station of the M-th electric vehicle; />A period of time for accessing a charging station for a single electric vehicle; />The upper limit of the charge and discharge power of the jth electric automobile at the t moment; />The lower limit of the charging and discharging power of the jth electric automobile at the t moment; />And->Is influenced by the number j of vehicles in the charging station, the SOC capacity of a single electric vehicle and the charge and discharge state factors of the single electric vehicle; />The SOC of the j-th electric automobile at the t moment; />Rated charging power for the j-th electric automobile;rated discharge power of the j-th electric automobile; />The output power increment of the charging pile of the M-th electric automobile at the t moment is set;/>the upper limit of the output power increment of the charging pile of the M-th electric automobile at the moment t; />The lower limit of the output power increment of the charging pile of the M-th electric automobile at the moment t; />The power of the j electric automobile at the t moment.
The system balance constraint condition specifically comprises:
in the formulae (17) to (19),respectively representing the t moment as the active output of the ith thermal power unit, the mth energy storage device, the nth wind motor group and the jth electric automobile; n (N) g 、N b 、N w 、N e Respectively representing the number of thermal power units, energy storage equipment, wind power units and electric automobiles; />Load disturbance at time t is represented; o1 (l) =b is a set of branches with a head-end node being node b; o2 (l) =b is the set of branches with end node being node b; p (P) l,t And Q l,t The active power and the reactive power of the branch circuit l at the time t are respectively;r l and x l The resistance and reactance of branch l, U 0 The voltage amplitude of the relaxation node at the time t; u (U) b,t The voltage amplitude of the node b at the time t; />Active power of a generator set connected with the node b; />Reactive power of a generator set connected with the node b; />Active power for a load connected to node b; />Reactive power for a load connected to node b; p (P) l,max And P l,min The upper limit and the lower limit of the active power of the branch circuit l are respectively; q (Q) l,max And Q l,min The upper limit and the lower limit of reactive power of the branch I are respectively; u (U) b,max And U b,min The upper and lower limits of the voltage at node b, respectively.
Compared with the prior art, the invention has the beneficial effects that:
1. according to the transmission and distribution frequency modulation resource cooperative control method based on the federal reinforcement learning algorithm, a regional power grid is divided into a main net zone and a plurality of distribution net zones, a piece of intelligent agent is arranged in a dispatching center of each piece of the regional power grid, the corresponding piece of intelligent agent is used for controlling, the combined frequency modulation of a traditional unit represented by the main net side and multiple types of generator sets of distributed power sources on the distribution net side is realized, the frequency modulation instruction distribution of the dispatching unit is realized by utilizing information interaction among the piece of intelligent agent, and the integral optimization of an autonomous region is realized. Therefore, in the design, the main distribution network structure of the regional power grid is fully utilized to divide the areas and set the intelligent bodies, the frequency modulation instruction distribution of the dispatching unit is realized by utilizing the information interaction among the intelligent bodies, and the integral optimization of the autonomous region is realized.
2. According to the transmission and distribution frequency modulation resource cooperative control method based on the federal reinforcement learning algorithm, the problem of cooperation among multiple intelligent agents is solved by using the federal reinforcement learning distributed algorithm, when each DQN neural network is trained, 96 point states in the day of the corresponding distribution network are adopted for offline training, the online decision time is effectively shortened by using the offline training, and the real-time performance of instruction execution is further improved. Therefore, in the design, the online decision time is effectively shortened by utilizing offline training, and the real-time performance of instruction execution is improved.
3. In the method, when adding homomorphic encryption is carried out on information of a local model of a DQN neural network and uploading the encrypted information to an aggregation center, each agent respectively encrypts a corresponding loss function by using a Paillier homomorphic encryption public key to obtain an encrypted loss function and transmits the encrypted loss function to the aggregation center, the aggregation center calculates the comprehensive loss function according to the encrypted loss function transmitted by each agent and transmits the comprehensive loss function to each agent, each agent respectively calculates gradient information of a current network in the local model of the DQN neural network corresponding to the comprehensive loss function, then each agent adds a security mask on the gradient information and transmits the gradient information added with the security mask to the aggregation center, the aggregation center receives the gradient information added with the security mask and then releases the homomorphic encryption of the gradient information and returns a result of the release of the homomorphic encryption to the corresponding agent, each agent updates a current parameter of the DQN neural network through the result of the release of the encryption and transmits the data in the local model of the DQN neural network, and the method is only used for the mutual encryption of the data is prevented from being leaked by the data in the method, and the mutual encryption risk of the data is prevented. Therefore, only the parameters of the model are transmitted and processed in the design, and the model parameters are encrypted by using the fully homomorphic encryption public key, so that the risk of data leakage in transmission and storage is avoided, and the privacy security of the frequency modulation user is ensured.
4. According to the transmission and distribution frequency modulation resource cooperative control method based on the federation reinforcement learning algorithm, an DQN training model is adopted for a neural network training framework of the federation reinforcement learning, and an optimal strategy is obtained by optimizing a state-action pair value function matrix Q (s, a) of iterative computation, so that the sum of expected discount returns is maximum, and the DQN training model can be well integrated into a multi-source cooperative frequency control framework of a main distribution network, and is more suitable for solving the distributed optimization problem of dynamic distribution of APC power. Therefore, the DQN training model is adopted in the design, so that the method is more suitable for solving the distributed optimization problem of dynamic distribution of the APC power.
Drawings
Fig. 1 is a frame diagram of a federal reinforcement learning algorithm-based transmission and distribution frequency modulation resource cooperative control system.
FIG. 2 is a schematic diagram of a federal reinforcement learning framework based on a DQN training network.
Detailed Description
The invention is described in further detail below with reference to the accompanying drawings and detailed description.
Referring to fig. 1 to 2, a cooperative control method for frequency modulation resources of a transmission and distribution system based on a federal reinforcement learning algorithm, the control method comprises the following steps:
s1, dividing a regional power grid into a main net area and a plurality of net distribution areas;
S2, setting an agent in a dispatching center of each zone, and establishing a corresponding DQN neural network model for each agent;
the dispatching centers of each patch are provided with an agent, the dispatching center of each patch is provided with an DQN neural network model, the agent positioned in the dispatching center of the same patch corresponds to the DQN neural network model, and the agent only trains, controls and operates the DQN neural network model corresponding to the agent.
S3, each agent respectively uses local data of a corresponding patch to perform local training on a corresponding DQN neural network model, performs addition homomorphic encryption on the information of the DQN neural network after the local training, and uploads the encrypted information to a polymerization center;
after the aggregation center obtains the total frequency modulation instruction, a training task, namely the frequency modulation instruction distribution of each unit, is issued, and after the aggregation center issues the training task, all the agents begin to execute the same task, namely, each agent utilizes the local data of each area to perform local training on the DQN neural network model.
S4, the aggregation center carries out gradient average processing on all the encrypted information, sends the information subjected to gradient average processing to each intelligent agent, receives the information subjected to gradient average processing, carries out subsequent training on the locally trained DQN neural network according to the information subjected to gradient average processing, and obtains a corresponding trained DQN neural network model, and the intelligent agent obtains a frequency modulation instruction of each unit which is subjected to scheduling in a corresponding patch area through the corresponding trained DQN neural network model.
In the step S3, when each agent uses the local data of the corresponding patch to perform local training on the DQN neural network model, each agent state space, action space and rewarding function are set according to the markov decision process;
the defining the state space of the agent specifically comprises:
the state space for setting the z-number agent specifically comprises:
the size of a total frequency adjusting instruction for determining the total deviation of frequency response in the frequency allocation process is taken as the state space of the z-number intelligent agent;
the amplitude of the frequency variation is divided into eight intervals:
{[∞,-0.2),[-0.2,-0.15),[-0.15,-0.10),[-0.10,0.003),[0.03,0.10),[0.10,0.15),[0.15,0.2),[0.2,+∞)};
the state of the z-number agent at the time t is S z,t ,S z,t ={S 1 ,S 2 ,S 3 ,S 4 ,S 5 ,S 6 ,S 7 ,S 8 S, where S 1 、S 8 Respectively representing states corresponding to the minimum value and the maximum value of the total frequency adjustment instruction of the system under a certain disturbance type;
the action space of the definition agent specifically comprises:
set up action space A that z number agent can decision z All control behaviors of the z-number agent are performed in the action space A z Selecting;
control behavior a of z-number intelligent agent at t moment z,t Can be expressed as:
in the formula (1):active output of the o-th thermal power unit controlled by the z-number intelligent agent at the time t; />The active force of the mth energy storage device controlled by the z-number intelligent agent at the time t; / >The active output of the nth wind turbine generator set controlled by the z-number intelligent agent at the time t; />The active force of the jth electric automobile group controlled by the z-number intelligent agent at the t moment;
the control behavior of the agent needs to meet constraints of the action space. Constraints of the action space include the following two classes: taking the power characteristic constraint of the dynamic response transmission process of the unit into consideration; the system balance constraint of the whole stable operation of the power system is considered, wherein the constraint condition of the system balance constraint mainly considers the difference between the joint frequency modulation of the main distribution network resources and the conventional multi-source collaborative frequency modulation.
The power supply characteristic constraint condition specifically includes:
thermal power generating unit operation constraint:
in the formula (10), the amino acid sequence of the compound,the upper limit and the lower limit of the output of the ith thermal power unit are respectively set; />The climbing rate of the ith thermal power generating unit; />The output of the ith thermal power unit at the moment t; />The output of the ith thermal power unit at the t-1 moment;
energy storage device operation constraints:
in the formula (11), the amino acid sequence of the compound,the minimum capacity of the mth energy storage device; />Maximum capacity of the mth energy storage device; />The battery capacity of the mth energy storage device at the t moment; />The charging power of the mth energy storage device at the t moment;a charging power constraint range for the mth energy storage device; / >The discharge power of the mth energy storage device at the t moment; />And->The discharge power constraint range of the mth energy storage device; />The battery capacity of the mth energy storage device at the time t+ 1; />The self-discharge efficiency of the mth energy storage device; />Charging efficiency of the mth energy storage device; />The discharge efficiency of the mth energy storage device;
running constraint of distributed wind farm units:
in the formula (12), the amino acid sequence of the compound,the lower limit of the output power of the nth wind driven generator; />The upper limit of the output power of the nth wind driven generator; />The output power of the nth wind driven generator at the t moment;
electric automobile group state constraint:
in the formulae (13) to (16),and->The SOC constraint range of the jth electric automobile; />SOC of the j-th electric automobile; />Outputting a constraint range of power increment for an M-th electric vehicle charging station; />The output power increment of the charging station of the M-th electric vehicle; />A period of time for accessing a charging station for a single electric vehicle; />The upper limit of the charge and discharge power of the jth electric automobile at the t moment; />The lower limit of the charging and discharging power of the jth electric automobile at the t moment; />And->Is influenced by the number j of vehicles in the charging station, the SOC capacity of a single electric vehicle and the charge and discharge state factors of the single electric vehicle; / >The SOC of the j-th electric automobile at the t moment; />Rated charging power for the j-th electric automobile;rated discharge power of the j-th electric automobile; />The output power increment of the charging pile of the M-th electric automobile at the t moment is set; />The upper limit of the output power increment of the charging pile of the M-th electric automobile at the moment t; />The lower limit of the output power increment of the charging pile of the M-th electric automobile at the moment t; />The power of the j electric automobile at the t moment.
The system balance constraint condition specifically comprises:
wherein, the formula (17) is a system power balance constraint, and the formulas (18) and (19) are distribution network operation related constraints;
in the formulae (17) to (19),respectively representing the t moment as the active output of the ith thermal power unit, the mth energy storage device, the nth wind motor group and the jth electric automobile; n (N) g 、N b 、N w 、N e Respectively representing the number of thermal power units, energy storage equipment, wind power units and electric automobiles; />Load disturbance at time t is represented; o1 (l) =b is a set of branches with a head-end node being node b; o2 (l) =b is the set of branches with end node being node b; p (P) l,t And Q l,t The active power and the reactive power of the branch circuit l at the time t are respectively; r is (r) l And x l The resistance and reactance of branch l, U 0 The voltage amplitude of the relaxation node at the time t; u (U) b,t The voltage amplitude of the node b at the time t; u (U) b+1,t The voltage amplitude of the node b at the time t+1; />Active power of a generator set connected with the node b; />Reactive power of a generator set connected with the node b; />Active power for a load connected to node b; />Reactive power for a load connected to node b; p (P) l,max And P l,min The upper limit and the lower limit of the active power of the branch circuit l are respectively; q (Q) l,max And Q l,min The upper limit and the lower limit of reactive power of the branch I are respectively; u (U) b,max And U b,min The upper and lower limits of the voltage at node b, respectively.
The setting of the rewarding function of the z-number agent specifically comprises the following steps:
setting rewards of environment on control behaviors of the z-number intelligent agent, minimizing deviation of the adjustment power command value and the power response value by changing the control behaviors of the z-number intelligent agent, namely, aiming at minimizing deviation of the adjustment power command value and the power response value, and constructing an objective function minF and a rewarding function R of the z-number intelligent agent z,t :
The bonus function is:
in the formulae (2) to (3), R z,t The rewarding function of the z-number intelligent agent at the time t; q is the number of control periods; q is the number of APC units in the patch corresponding to the z-number intelligent agent; i is the ith APC unit in the patch corresponding to the z-number agent; t is the t-th discrete control period; deltaP i G The power command value is adjusted for the input of the ith APC unit in the zone corresponding to the z-number intelligent agent; deltaP i R In the region corresponding to the z-number intelligent agentThe power response value of the ith APC unit;
the cost function for the objective function min F is obtained by discounted cumulative summation:
in the formula (4), the amino acid sequence of the compound,for z-number intelligent agent in control action a z A function of making a corresponding reward for the control action; />To control the behavior a z All jackpots generated are averaged; gamma ray t' ∈[0,1],γ t' Is a discount coefficient; r is R z,t' Is the accumulation of bonus functions corresponding to a plurality of consecutive behaviors of the z-number agent.
In the step S3, the local training of the DQN neural network model by the z-number agent using the local data of the corresponding patch specifically includes:
s31, initializing a current network parameter theta of a corresponding DQN neural network model by an agent 1,t 、θ 2,t ...θ z,t And copy a target network with the same structure as the current network
S32, training the DQN neural network model by the z-number agent according to the state data of 96 time periods in the day of the corresponding patch, and updating the parameters of the target network.
Dividing every 15 minutes in 24 hours into a time period, adding 96 time periods, acquiring state data of a patch corresponding to the z-number intelligent agent in the 96 time periods, namely, the state data of 96 time periods in the day, and training a corresponding DQN neural network model by using the data. The intelligent agent trains the DQN neural network model for a plurality of times, and immediately copies the current network parameters to the target network after training the DQN neural network model once by each intelligent agent, and updates the parameters of the target network.
In the step S32, training the DQN neural network model by the z-agent with the state data of 96 time periods within a day corresponding to the patch includes:
s321, the z-number intelligent agent acquires state data of 96 time periods in the day of the corresponding power grid sheet area, and selects the state data of one time period in the 96 point states in the day as the current state S of the z-number intelligent agent t ;
S322, current state S based on z number agent t Trial-error is performed by adopting an epsilon-greedy strategy, namely, the control behavior a is selected by using a random strategy with probability epsilon t Selecting current optimal control behavior with probability 1-epsilona t 、/>
equation (5) represents selecting an optimal Q value as a current Q value;
s323, calculating the magnitude r of the rewarding function after executing the control behavior a according to the selected control behavior a and the current network in the DQN neural network model t And the corresponding Q value is calculated by the following function:
Q(s t ,a t )=Q(s t ,a t )+η[r t +μmaxQ(s t+1 ,a t+1 )-Q(s t ,a t )] (6);
in the formulas (5) and (6), Q(s) t ,a t ) The Q value of the current network; maxQ(s) t+1 ,a t+1 ) The Q value of the target network; η is the learning rate; m is a reward attenuation coefficient; r is (r) t ∈{r 1,t ,r m,t ,...,r n,t },{r 1,t ,r m,t ,...,r n,t The z number agent rewards function set;
s324, according to the selected control behavior a, acquiring the next state S returned by the environment after the z-number intelligent agent executes the selected control behavior a t+1 Obtaining an empirical sample (s t ,a,r t ,s t+1 ) And the empirical sample (s t ,a,r t ,s t+1 ) Storing the experience playback pool;
s325, updating the current state of the z-number agent to be the next state returned by the environment, and repeating the steps S322-S324 until the experience playback pool is full;
s326, after the experience playback pool is full, extracting omega experience samples from the experience playback pool for calculation, and updating the loss function:
in the formula (7): f (F) z As a loss function; r is (r) i,z Is a reward function; q(s) z,i ,a z,i ) The Q value of the current network corresponding to the experience sample;the Q value of the target network corresponding to the experience sample; s is(s) z,i ,s z,i+1 ∈S z,i The state of the current action and the state of the target network action belong to a state space set of the intelligent agent z; a, a i ,a z,i+1 ∈A z,i Indicating that both the current action and the target network action belong to the set of action spaces of agent z.
In the step S3, the addition homomorphic encryption is performed on the information of the DQN neural network after the local training, and the uploading aggregation center of the encrypted information specifically includes:
s34, each agent encrypts the corresponding loss function in the local trained DQN neural network model by using the Paillier full homomorphic encryption public key K to obtain the encrypted loss functionRepresenting the fully homomorphic encrypted result; / >
In the step S4, the aggregation center performs gradient average processing on all the encrypted information, and sends the information after gradient average processing to each agent specifically includes:
the aggregation center calculates and obtains a comprehensive loss function according to the encrypted loss function sent by each agent And the comprehensive loss function->Sending the data to each intelligent agent:
in formula (8):representing the summation of multiple encrypted loss functions, R y,z Bonus function for agent z +.>The Q value of the target network corresponding to the z-number intelligent agent; q(s) z,i ,a z,i ) The current network Q value corresponding to the z-number intelligent agent; η is the learning rate; m is a reward attenuation coefficient; y is the total number of agents;
each agent receives the information after gradient average treatment, and carries out subsequent training on the corresponding local trained DQN neural network according to the information after gradient average treatment, and specifically comprises the following steps:
receiving loss function of each agentAnd calculating the current network relative comprehensive loss function in the corresponding local trained DQN neural network model>Gradient information of (2);
each intelligent agent adds a safety mask on the gradient information and transmits the gradient information added with the safety mask to the aggregation center;
The aggregation center receives the gradient information added with the security mask, then releases homomorphic encryption of the gradient information, and returns the result after releasing homomorphic encryption to the corresponding intelligent agent;
each agent receives the result after the full homomorphic encryption is released, and removes the subnet mask in the result to obtain gradient information without encryption, and each agent updates the corresponding parameter theta of the current network in the local trained DQN neural network model through the gradient information without encryption z,t 。
Updating the corresponding local trained parameters theta of the current network in the DQN neural network model z,t The formula of (2) is:
in the formula (9): f is a loss function, θ z,t Updated current network parameters for z-number agent, θ z,t-1 The current network parameters before the z-number agent is updated.
Federal reinforcement learning follows a Markov decision process (markov decision process, MDP) with DQN as the training network neural model, and the MDP followed by each agent can be expressed as a tuple (z, s) t ,a t ,r,s t+1 ) Wherein z is the agent number; s is(s) t When it is the agent tA state of engraving; a, a t Control actions executed for the time t of the intelligent agent; r is the state s of the intelligent agent t Executing action a t A reward obtained later; s is(s) t+1 For the intelligent agent in state s t Executing control action a t And then transitions to the next time state. During federal reinforcement learning, as shown in fig. 2, an initial state is selected randomly, then a control action is selected based on the initial state, after the control action is selected, the agent will execute the control action in the environment, and then the environment will return to the next time state s t+1 And the prize r obtained, in which case the quadruple (z, s t ,a t ,r,s t+1 ) And storing into an experience pool. The next time state s will then be t+1 Regarded as the current state s t Repeating the above steps until the experience pool is full. And then, carrying out gradient average on error gradient functions in Q network training on a plurality of agents through an aggregation center, and returning information after gradient average to follow-up training of each agent by the aggregation center to guide, so that the plurality of agents train the same task and carry out information interaction.
The principle of the invention is explained as follows:
the intelligent agent comprises a main network intelligent agent and a distribution network intelligent agent, wherein the patch intelligent agent corresponding to the main network area is the main network intelligent agent, and the patch intelligent agent corresponding to the distribution network area is the distribution network intelligent agent, wherein the main network intelligent agent has the function of an aggregation center in federal reinforcement learning. As shown in fig. 1, a framework diagram of a coordinated control system of frequency modulation resources of transmission and distribution based on a federal reinforcement learning algorithm divides a regional power grid according to structures of a main network and a plurality of distribution networks, and sets a regional intelligent agent in each main network and a distribution network dispatching center. According to the design, on the basis of the basic structure, main function guidance and connection power supply characteristics of the main distribution network, the cooperative control problem of the frequency modulation resources of transmission and distribution is optimized, in the control transfer from the existing centralized control to the distributed mode, the fact that a calculation and information interaction platform is transferred from an original individual and main network dispatching center to a network side represented by a power distribution network side is considered, communication calculation pressure of the main network serving as a high-level dispatching center during dispatching can be relieved, and active control capability of the distribution network dispatching center under the characteristics of localization and activation of a future active power distribution network is fully exerted. Secondly, because the distributed power supply participates in the frequency modulation unit mostly from enterprises and users, the part of people are extremely sensitive to the problem of privacy safety, the federal reinforcement learning algorithm is utilized to solve the problem of cooperation among multiple intelligent agents, and on the premise of ensuring the privacy safety of the users, the online decision time is shortened by utilizing offline training, so that the requirement of distributed execution on real-time decision is met.
The APC unit is a generator unit which automatically tracks a power dispatching instruction in a specified output adjustment range and adjusts power generation/power utilization power in real time according to a certain adjustment rate so as to meet the requirements of active balance, frequency stability and tie line power control of a power system.
Example 1:
a cooperative control method for frequency modulation resources of transmission and distribution based on a federal reinforcement learning algorithm comprises the following steps:
s1, dividing a regional power grid into a main net area and a plurality of net distribution areas;
s2, setting an agent in a dispatching center of each zone, and establishing a corresponding DQN neural network model for each agent;
s3, each agent respectively uses local data of a corresponding patch to perform local training on the corresponding DQN neural network model, performs addition homomorphic encryption on the information of the DQN neural network model after the local training, and uploads the encrypted information to the aggregation center;
s4, the aggregation center carries out gradient average processing on all the encrypted information, sends the information subjected to gradient average processing to each intelligent agent, receives the information subjected to gradient average processing, carries out subsequent training on the corresponding local trained DQN neural network model according to the information subjected to gradient average processing, obtains a trained DQN neural network model, and obtains a frequency modulation instruction of each unit which is subjected to scheduling in a corresponding patch area through the trained DQN neural network model.
In the step S3, when each agent uses the local data of the corresponding patch to perform local training on the DQN neural network model, each agent state space, action space and rewarding function are set according to the markov decision process;
the state space for setting the z-number agent specifically comprises:
the size of a total frequency adjusting instruction for determining the total deviation of frequency response in the frequency allocation process is taken as the state space of the z-number intelligent agent;
the state of the z-number agent at the time t is S z,t ;
The action space for setting the z-number intelligent agent specifically comprises:
set up action space A that z number agent can decision z All control behaviors of the z-number agent are performed in the action space A z Selecting;
control behavior a of z-number intelligent agent at t moment z,t Can be expressed as:
in the formula (1):active output of the o-th thermal power unit controlled by the z-number intelligent agent at the time t; />The active force of the mth energy storage device controlled by the z-number intelligent agent at the time t; />The active output of the nth wind turbine generator set controlled by the z-number intelligent agent at the time t; />The active force of the jth electric automobile group controlled by the z-number intelligent agent at the t moment;
the setting of the rewarding function of the z-number agent specifically comprises the following steps:
setting rewards of the environment on the control behaviors of the z-number intelligent agent, aiming at minimizing deviation of the adjustment power instruction value and the power response value, and constructing a rewarding function of the z-number intelligent agent:
In the formulae (2) to (3), R z,t The rewarding function of the z-number intelligent agent at the time t; q is the number of control periods; q is the number of APC units in the patch corresponding to the z-number intelligent agent; i is the ith APC unit in the patch corresponding to the z-number agent; t is the t-th discrete control period; deltaP i G The power command value is adjusted for the input of the ith APC unit in the zone corresponding to the z-number intelligent agent; deltaP i R The power response value of the ith APC unit in the patch corresponding to the z-number intelligent agent;
the cost function for the objective function minF is obtained by discounted cumulative summation:
in the formula (4), the amino acid sequence of the compound,to control the behavior a z All jackpots generated are averaged; gamma ray t' ∈[0,1],γ t' Is a discount coefficient; r is R z,t' Is the accumulation of bonus functions corresponding to a plurality of consecutive behaviors of the z-number agent.
In the step S3, the local training of the DQN neural network model by the z-number agent using the local data of the corresponding patch specifically includes:
s31, initializing current network parameters of a corresponding DQN neural network model by a z-number agent, and copying a target network with the same structure;
s32, training the DQN neural network model by the z-number agent according to the state data of 96 time periods in the day of the corresponding patch, and updating the parameters of the target network.
In the step S32, training the DQN neural network model by the z-agent with the state data of 96 time periods within a day corresponding to the patch includes:
s321, selecting state data of a time period from state data of 96 time periods in the day as the current state S of the z-number intelligent agent t ;
S322, current state S based on z number agent t Trial-error is performed by adopting an epsilon-greedy strategy, namely, the control behavior a is selected by using a random strategy with probability epsilon t Selecting current optimal control behavior with probability 1-epsilona t 、/>
s323, calculating the magnitude r of the rewarding function after executing the control behavior a according to the selected control behavior a and the current network in the DQN neural network model t And the Q value is calculated by the following function:
Q(s t ,a t )=Q(s t ,a t )+η[r t +μmaxQ(s t+1 ,a t+1 )-Q(s t ,a t )] (6);
in the formulas (5) and (6), Q(s) t ,a t ) Is the current Q value; maxQ(s) t+1 ,a t+1 ) Is the target Q value; η is the learning rate; mu is a reward attenuation coefficient;
s324, according to the selected control behavior a, acquiring the z-number agent to execute the selectedTaking the next state s returned by the environment after the control action a t+1 Obtaining an empirical sample (s t ,a,r t ,s t+1 ) And the empirical sample (s t ,a,r t ,s t+1 ) Storing the experience playback pool;
s325, updating the current state of the z-number agent to be the next state returned by the environment, and repeating the steps S322-S324 until the experience playback pool is full;
S326, after the experience playback pool is full, extracting omega experience samples from the experience playback pool for calculation, and updating the loss function:
in the formula (7): f (F) z As a loss function; r is (r) i,z A reward function for the z-number agent; q(s) z,i ,a z,i ) The Q value of the current network;the target Q value corresponding to the experience sample.
In the step S3, the addition homomorphic encryption is performed on the information of the locally trained DQN neural network model, and the uploading aggregation center of the encrypted information specifically includes:
s34, each intelligent agent encrypts a corresponding loss function in the locally trained DQN neural network model by using the Paillier full homomorphic encryption public key K to obtain an encrypted loss function;
s35, each agent transmits the encrypted loss function to the aggregation center.
In step S4, the aggregation center performs gradient average processing on all the encrypted information, and sends the information after gradient average processing to each agent, each agent receives the information after gradient average processing, and performs subsequent training on the corresponding locally trained DQN neural network model according to the information after gradient average processing, which specifically includes:
s41, the polymerization center is based on eachThe encrypted loss function sent by the intelligent agent is calculated to obtain the comprehensive loss function
In formula (8):representing the summation of multiple encrypted loss functions, R y,z Bonus function for agent z +.>The Q value of the target network corresponding to the z-number intelligent agent; q(s) z,i ,a z,i ) The current Q value corresponding to the z-number intelligent agent; η is the learning rate; mu is a reward attenuation coefficient; y is the total number of agents;
s42, the aggregation center synthesizes the loss functionTransmitting to each intelligent agent, each intelligent agent respectively calculating the relative comprehensive loss function of the current network in the corresponding local trained DQN neural network model>Gradient information of (2);
s43, each agent adds a safety mask on the gradient information, and transmits the gradient information with the safety mask to the aggregation center;
s44, after receiving the gradient information added with the security mask, the aggregation center releases homomorphic encryption of the gradient information, and returns a result after releasing homomorphic encryption to the corresponding intelligent agent;
s45, each agent receives the result after the full homomorphic encryption is released, and removes the subnet mask in the result to obtain gradient information without encryption, each agent passes throughUpdating the corresponding local trained parameter θ of the current network in the DQN neural network model without encrypted gradient information z,t 。
In the step S45, the corresponding local trained parameters θ of the current network in the DQN neural network model are updated z,t The formula of (2) is:
in the formula (9): f is a loss function, θ z,t Updated current network parameters for z-number agent, θ z,t-1 The current network parameters before the z-number agent is updated.
Example 2:
example 2 is substantially the same as example 1 except that:
the control behavior in the action space accords with the power characteristic constraint condition and the system balance constraint condition.
The power supply characteristic constraint condition specifically includes:
thermal power generating unit operation constraint:
in the formula (10), the amino acid sequence of the compound,the upper limit and the lower limit of the output of the ith thermal power unit are respectively set; />The climbing rate of the ith thermal power generating unit; />The output of the ith thermal power unit at the moment t; />I-th thermal power machine at t-1 momentThe output of the group;
energy storage device operation constraints:
in the formula (11), the amino acid sequence of the compound,the capacity constraint range of the mth energy storage device; />The capacity of the mth energy storage device at the t moment; />The charging power of the mth energy storage device at the t moment; />A charging power constraint range for the mth energy storage device; />The discharge power of the mth energy storage device at the t moment; />And->The discharge power constraint range of the mth energy storage device; />The battery capacity of the mth energy storage device at the time t+ 1; />The self-discharge efficiency of the mth energy storage device; / >Charging efficiency of the mth energy storage device; />The discharge efficiency of the mth energy storage device;
running constraint of distributed wind farm units:
in the formula (12), the amino acid sequence of the compound,the lower limit of the output power of the nth wind driven generator; />The upper limit of the output power of the nth wind driven generator; />The output power of the nth wind driven generator at the t moment;
electric automobile group state constraint:
in the formulae (13) to (16),and->The SOC constraint range of the jth electric automobile; />SOC of the j-th electric automobile; />Outputting a constraint range of power increment for an M-th electric vehicle charging station; />The output power increment of the charging station of the M-th electric vehicle; />A period of time for accessing a charging station for a single electric vehicle; />The upper limit of the charge and discharge power of the jth electric automobile at the t moment; />The lower limit of the charging and discharging power of the jth electric automobile at the t moment; />And->Is influenced by the number j of vehicles in the charging station, the SOC capacity of a single electric vehicle and the charge and discharge state factors of the single electric vehicle; />The SOC of the j-th electric automobile at the t moment; />Rated charging power for the j-th electric automobile;rated discharge power of the j-th electric automobile; />The output power increment of the charging pile of the M-th electric automobile at the t moment is set; / >The upper limit of the output power increment of the charging pile of the M-th electric automobile at the moment t; />The lower limit of the output power increment of the charging pile of the M-th electric automobile at the moment t; />The power of the j electric automobile at the t moment.
The system balance constraint condition specifically comprises:
in the formulae (17) to (19),respectively representing the t moment as the active output of the ith thermal power unit, the mth energy storage device, the nth wind motor group and the jth electric automobile; n (N) g 、N b 、N w 、N e Respectively representing the number of thermal power units, energy storage equipment, wind power units and electric automobiles; p (P) t L Load disturbance at time t is represented; o1 (l) =b is a set of branches with a head-end node being node b; o2 (l) =b is the set of branches with end node being node b; p (P) l,t And Q l,t The active power and the reactive power of the branch circuit l at the time t are respectively; r is (r) l And x l The resistance and reactance of branch l, U 0 The voltage amplitude of the relaxation node at the time t; u (U) b,t The voltage amplitude of the node b at the time t; />Active power of a generator set connected with the node b; />Reactive power of a generator set connected with the node b; />Active power for a load connected to node b; />Reactive power for a load connected to node b; p (P) l,max And P l,min The upper limit and the lower limit of the active power of the branch circuit l are respectively; q (Q) l,max And Q l,min The upper limit and the lower limit of reactive power of the branch I are respectively; u (U) b,max And U b,min The upper and lower limits of the voltage at node b, respectively.
The above description is merely of preferred embodiments of the present invention, and the scope of the present invention is not limited to the above embodiments, but all equivalent modifications or variations according to the present disclosure will be within the scope of the claims.
Claims (10)
1. A transmission and distribution frequency modulation resource cooperative control method based on a federal reinforcement learning algorithm is characterized by comprising the following steps of:
the control method comprises the following steps:
s1, dividing a regional power grid into a main net area and a plurality of net distribution areas;
s2, setting an agent in a dispatching center of each zone, and establishing a corresponding DQN neural network model for each agent;
s3, each agent respectively uses local data of a corresponding patch to perform local training on the corresponding DQN neural network model, performs homomorphic encryption on the information of the DQN neural network model after the local training, and uploads the encrypted information to the aggregation center;
s4, the aggregation center carries out gradient average processing on all the encrypted information, sends the information subjected to gradient average processing to each intelligent agent, receives the information subjected to gradient average processing, carries out subsequent training on the corresponding local trained DQN neural network model according to the information subjected to gradient average processing, and obtains a trained DQN neural network model, and the frequency modulation instruction of each unit which is subjected to scheduling is obtained through the trained DQN neural network model.
2. The federal reinforcement learning algorithm-based transmission and distribution frequency modulation resource cooperative control method according to claim 1, wherein the method is characterized by comprising the following steps of:
in the step S3, when each agent uses the local data of the corresponding patch to perform local training on the DQN neural network model, each agent state space, action space and rewarding function are set according to the markov decision process;
the state space for setting the z-number agent specifically comprises:
the size of a total frequency adjusting instruction for determining the total deviation of frequency response in the frequency allocation process is taken as the state space of the z-number intelligent agent;
the state of the z-number agent at the time t is S z,t ;
The action space for setting the z-number intelligent agent specifically comprises:
set up action space A that z number agent can decision z All of the z-agentControl actions all from action space A z Selecting;
control behavior a of z-number intelligent agent at t moment z,t Can be expressed as:
in the formula (1):active output of the o-th thermal power unit controlled by the z-number intelligent agent at the time t; />The active force of the mth energy storage device controlled by the z-number intelligent agent at the time t; />The active output of the nth wind turbine generator set controlled by the z-number intelligent agent at the time t; />The active force of the jth electric automobile group controlled by the z-number intelligent agent at the t moment;
The setting of the rewarding function of the z-number agent specifically comprises the following steps:
setting rewards of the environment on the control behaviors of the z-number intelligent agent, aiming at minimizing deviation of the adjustment power instruction value and the power response value, and constructing a rewarding function of the z-number intelligent agent:
in the formulae (2) to (3), R z,t The rewarding function of the z-number intelligent agent at the time t; q is the number of control periods; q is the number of APC units in the patch corresponding to the z-number intelligent agent; i is the ith APC unit in the patch corresponding to the z-number agent; t is the t-th discrete control period; deltaP i G The power command value is adjusted for the input of the ith APC unit in the zone corresponding to the z-number intelligent agent; deltaP i R The power response value of the ith APC unit in the patch corresponding to the z-number intelligent agent;
the cost function for the objective function minF is obtained by discounted cumulative summation:
in the formula (4), the amino acid sequence of the compound,to control the behavior a z All jackpots generated are averaged; gamma ray t′ ∈[0,1],γ t′ Is a discount coefficient; r is R z,t' Is the accumulation of bonus functions corresponding to a plurality of consecutive behaviors of the z-number agent.
3. The federal reinforcement learning algorithm-based transmission and distribution frequency modulation resource cooperative control method according to claim 2, wherein the method is characterized by comprising the following steps of:
in the step S3, the local training of the corresponding DQN neural network model by the z-number agent using the local data of the corresponding patch specifically includes:
S31, initializing current network parameters of a corresponding DQN neural network model by a z-number agent, and copying a target network with the same structure as the current network;
s32, training the DQN neural network model by the z-number agent according to the state data of 96 time periods in the day of the corresponding patch, and updating the parameters of the target network.
4. The method for cooperatively controlling the frequency modulation resources of the transmission and distribution system based on the federal reinforcement learning algorithm according to claim 3, wherein the method comprises the following steps:
in the step S32, training the DQN neural network model by the z-agent with the state data of 96 time periods within a day corresponding to the patch includes:
s321, selecting state data of a time period from state data of 96 time periods in the day as the current state S of the z-number intelligent agent t ;
S322, current state S based on z number agent t Trial-error is performed by adopting an epsilon-greedy strategy, namely, the control behavior a is selected by using a random strategy with probability epsilon t Selecting current optimal control behavior with probability 1-epsilona t 、/>
s323, calculating the magnitude r of the rewarding function after executing the control behavior a according to the selected control behavior a t And the Q value is calculated by the following function:
Q(s t ,a t )=Q(s t ,a t )+η[r t +μmaxQ(s t+1 ,a t+1 )-Q(s t ,a t )] (6);
in the formulas (5) and (6), Q(s) t ,a t ) The Q value of the current network; maxQ(s) t+1 ,a t+1 ) The Q value of the target network; η is the learning rate; mu is a reward attenuation coefficient;
s324, according to the selected control behavior a, acquiring the next state S returned by the environment after the z-number intelligent agent executes the selected control behavior a t+1 Obtaining an empirical sample (s t ,a,r t ,s t+1 ) And the empirical sample (s t ,a,r t ,s t+1 ) Storing the experience playback pool;
s325, updating the current state of the z-number agent to be the next state returned by the environment, and repeating the steps S322-S324 until the experience playback pool is full;
s326, after the experience playback pool is full, extracting omega experience samples from the experience playback pool for calculation, and updating the loss function:
5. The federal reinforcement learning algorithm-based transmission and distribution frequency modulation resource cooperative control method according to claim 1, wherein the method is characterized by comprising the following steps of:
in the step S3, homomorphic encryption is performed on the information of the locally trained DQN neural network model, and the uploading aggregation center of the encrypted information specifically includes:
s34, each intelligent agent encrypts a corresponding loss function in the locally trained DQN neural network model by using the Paillier full homomorphic encryption public key K to obtain an encrypted loss function;
S35, each agent transmits the encrypted loss function to the aggregation center.
6. The federal reinforcement learning algorithm-based transmission and distribution frequency modulation resource cooperative control method according to claim 5, wherein the method is characterized by comprising the following steps:
in step S4, the aggregation center performs gradient average processing on all the encrypted information, and sends the information after gradient average processing to each agent, each agent receives the information after gradient average processing, and performs subsequent training on the corresponding locally trained DQN neural network model according to the information after gradient average processing, which specifically includes:
s41, the aggregation center calculates and obtains a comprehensive loss function according to the encrypted loss function sent by each intelligent agent
In formula (8):representing the summation of multiple encrypted loss functions, R y,z Bonus function for agent z +.>The Q value of the target network corresponding to the z-number intelligent agent; q(s) z,i ,a z,i ) The current network Q value corresponding to the z-number intelligent agent; η is the learning rate; mu is a reward attenuation coefficient; y is the total number of agents;
s42, the aggregation center synthesizes the loss functionTransmitting to each agent according to the current network and the comprehensive loss function in the corresponding local trained DQN neural network model >Calculating gradient information;
s43, each agent adds a safety mask on the gradient information, and transmits the gradient information with the safety mask to the aggregation center;
s44, after receiving the gradient information added with the security mask, the aggregation center releases homomorphic encryption of the gradient information, and returns a result after releasing homomorphic encryption to the corresponding intelligent agent;
s45, each agent receives the result after the full homomorphic encryption is released, and removes the subnet mask in the result to obtain gradient information without encryption, and each agent updates the corresponding parameter theta of the current network in the local trained DQN neural network model through the gradient information without encryption z,t 。
7. The federal reinforcement learning algorithm-based transmission and distribution frequency modulation resource cooperative control method according to claim 6, wherein the method is characterized by comprising the following steps:
in the step S45, the corresponding local trained parameters θ of the current network in the DQN neural network model are updated z,t The formula of (2) is:
in the formula (9): f is a loss function, θ z,t Updated current network parameters for z-number agent, θ z,t-1 The current network parameters before the z-number agent is updated.
8. The federal reinforcement learning algorithm-based transmission and distribution frequency modulation resource cooperative control method according to claim 2, wherein the method is characterized by comprising the following steps of:
The control behavior in the action space accords with the power supply characteristic constraint condition and the system balance constraint condition.
9. The federal reinforcement learning algorithm-based transmission and distribution frequency modulation resource cooperative control method according to claim 8, wherein the method is characterized by comprising the following steps:
the power supply characteristic constraint condition specifically includes:
thermal power generating unit operation constraint:
in the formula (10), the amino acid sequence of the compound,the upper limit and the lower limit of the output of the ith thermal power unit are respectively set; />The climbing rate of the ith thermal power generating unit; />The output of the ith thermal power unit at the moment t; />The output of the ith thermal power unit at the t-1 moment;
energy storage device operation constraints:
in the formula (11), the amino acid sequence of the compound,the capacity constraint range of the mth energy storage device; />The capacity of the mth energy storage device at the t moment; />The charging power of the mth energy storage device at the t moment; />A charging power constraint range for the mth energy storage device; />The discharge power of the mth energy storage device at the t moment; />And->The discharge power constraint range of the mth energy storage device; />The battery capacity of the mth energy storage device at the time t+1; />The self-discharge efficiency of the mth energy storage device; />Charging efficiency of the mth energy storage device; />The discharge efficiency of the mth energy storage device;
running constraint of distributed wind farm units:
In the formula (12), the amino acid sequence of the compound,the lower limit of the output power of the nth wind driven generator; />The upper limit of the output power of the nth wind driven generator; />The output power of the nth wind driven generator at the t moment;
electric automobile group state constraint:
in the formulae (13) to (16),and->The SOC constraint range of the jth electric automobile; />SOC of the j-th electric automobile; />Outputting a constraint range of power increment for an M-th electric vehicle charging station; />The output power increment of the charging station of the M-th electric vehicle; />A period of time for accessing a charging station for a single electric vehicle; />The upper limit of the charging and discharging power of the jth electric automobile at the t moment; />The lower limit of the charging and discharging power of the jth electric automobile at the t moment; />And->Is influenced by the number j of vehicles in the charging station, the SOC capacity of a single electric vehicle and the charge and discharge state factors of the single electric vehicle; />The SOC of the j-th electric automobile at the t moment; />Rated charging power for the j-th electric automobile; />Rated discharge power of the j-th electric automobile; />The output power increment of the charging pile of the M-th electric automobile at the t moment is set; />The upper limit of the output power increment of the charging pile of the M-th electric automobile at the moment t; />The lower limit of the output power increment of the charging pile of the M-th electric automobile at the moment t; / >The power of the j electric automobile at the t moment.
10. The federal reinforcement learning algorithm-based transmission and distribution frequency modulation resource cooperative control method according to claim 8, wherein the method is characterized by comprising the following steps:
the system balance constraint condition specifically comprises:
in the formulae (17) to (19),respectively representing the t moment as the active output of the ith thermal power unit, the mth energy storage device, the nth wind motor group and the jth electric automobile; n (N) g 、N b 、N w 、N e Respectively representing the number of thermal power units, energy storage equipment, wind power units and electric automobiles; p (P) t L Load disturbance at time t is represented; o1 (l) =b is a set of branches with a head-end node being node b; o2 (l) =b is the set of branches with end node being node b; p (P) l,t And Q l,t The active power and the reactive power of the branch circuit l at the time t are respectively; r is (r) l And x l Resistors of the branches l, respectivelyAnd reactance, U 0 The voltage amplitude of the relaxation node at the time t; u (U) b,t The voltage amplitude of the node b at the time t; />Active power of a generator set connected with the node b; />Reactive power of a generator set connected with the node b; />Active power for a load connected to node b; />Reactive power for a load connected to node b; p (P) l,max And P l,min The upper limit and the lower limit of the active power of the branch circuit l are respectively; q (Q) l,max And Q l,min The upper limit and the lower limit of reactive power of the branch I are respectively; u (U) b,max And U b,min The upper and lower limits of the voltage at node b, respectively. />
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211728739.5A CN116054285A (en) | 2022-12-30 | 2022-12-30 | Transmission and distribution frequency modulation resource cooperative control method based on federal reinforcement learning algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211728739.5A CN116054285A (en) | 2022-12-30 | 2022-12-30 | Transmission and distribution frequency modulation resource cooperative control method based on federal reinforcement learning algorithm |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116054285A true CN116054285A (en) | 2023-05-02 |
Family
ID=86119362
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211728739.5A Pending CN116054285A (en) | 2022-12-30 | 2022-12-30 | Transmission and distribution frequency modulation resource cooperative control method based on federal reinforcement learning algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116054285A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117151308A (en) * | 2023-10-30 | 2023-12-01 | 国网浙江省电力有限公司杭州供电公司 | Comprehensive energy system optimal scheduling method and system based on federal reinforcement learning |
-
2022
- 2022-12-30 CN CN202211728739.5A patent/CN116054285A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117151308A (en) * | 2023-10-30 | 2023-12-01 | 国网浙江省电力有限公司杭州供电公司 | Comprehensive energy system optimal scheduling method and system based on federal reinforcement learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Jia et al. | Coordinated control for EV aggregators and power plants in frequency regulation considering time-varying delays | |
CN112117760A (en) | Micro-grid energy scheduling method based on double-Q-value network deep reinforcement learning | |
CN110826880B (en) | Active power distribution network optimal scheduling method for large-scale electric automobile access | |
CN110137981B (en) | Distributed energy storage aggregator AGC method based on consistency algorithm | |
CN107069776A (en) | A kind of energy storage prediction distributed control method of smooth microgrid dominant eigenvalues | |
CN108376990B (en) | Control method and system of energy storage power station | |
CN103904641B (en) | The micro-electrical network intelligent power generation of isolated island control method based on correlated equilibrium intensified learning | |
CN116345577B (en) | Wind-light-storage micro-grid energy regulation and optimization method, device and storage medium | |
CN112381424A (en) | Multi-time scale active power optimization decision method for uncertainty of new energy and load | |
Alfaverh et al. | Optimal vehicle-to-grid control for supplementary frequency regulation using deep reinforcement learning | |
CN107394798A (en) | Electric automobile comprising Time-varying time-delays and generator group coordination control method for frequency | |
CN116054285A (en) | Transmission and distribution frequency modulation resource cooperative control method based on federal reinforcement learning algorithm | |
CN115036963B (en) | Two-stage demand response strategy for improving toughness of power distribution network | |
CN116154826A (en) | Charging and discharging load control system for reducing and adjusting heavy overload of distribution transformer area | |
CN114336694B (en) | Energy optimization control method for hybrid energy storage power station | |
CN116957294A (en) | Scheduling method for virtual power plant to participate in electric power market transaction based on digital twin | |
CN108390387A (en) | A kind of source lotus peak regulation control method of dynamic self-discipline decentralized coordinating | |
CN114566971A (en) | Real-time optimal power flow calculation method based on near-end strategy optimization algorithm | |
CN109149658B (en) | Independent micro-grid distributed dynamic economic dispatching method based on consistency theory | |
CN114039364A (en) | Demand opportunity constraint-based distributed battery energy storage cluster frequency modulation method and device | |
CN109066702A (en) | A kind of load bilayer control method based on response potentiality | |
CN112018798B (en) | Multi-time scale autonomous operation method for power distribution network with regional energy storage station participating in disturbance stabilization | |
CN111327076B (en) | Energy storage type fan scheduling response method based on distributed accounting | |
Skiparev et al. | Reinforcement learning based MIMO controller for virtual inertia control in isolated microgrids | |
CN112165086B (en) | Online optimization system of active power distribution network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |