CN116054285A - Transmission and distribution frequency modulation resource cooperative control method based on federal reinforcement learning algorithm - Google Patents

Transmission and distribution frequency modulation resource cooperative control method based on federal reinforcement learning algorithm Download PDF

Info

Publication number
CN116054285A
CN116054285A CN202211728739.5A CN202211728739A CN116054285A CN 116054285 A CN116054285 A CN 116054285A CN 202211728739 A CN202211728739 A CN 202211728739A CN 116054285 A CN116054285 A CN 116054285A
Authority
CN
China
Prior art keywords
agent
power
neural network
network model
intelligent agent
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211728739.5A
Other languages
Chinese (zh)
Inventor
陈然
许汉平
周蠡
蔡杰
贺兰菲
周英博
李吕满
张赵阳
廖晓红
熊一
孙利平
熊川羽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Economic and Technological Research Institute of State Grid Hubei Electric Power Co Ltd
Original Assignee
Economic and Technological Research Institute of State Grid Hubei Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Economic and Technological Research Institute of State Grid Hubei Electric Power Co Ltd filed Critical Economic and Technological Research Institute of State Grid Hubei Electric Power Co Ltd
Priority to CN202211728739.5A priority Critical patent/CN116054285A/en
Publication of CN116054285A publication Critical patent/CN116054285A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H02GENERATION; CONVERSION OR DISTRIBUTION OF ELECTRIC POWER
    • H02JCIRCUIT ARRANGEMENTS OR SYSTEMS FOR SUPPLYING OR DISTRIBUTING ELECTRIC POWER; SYSTEMS FOR STORING ELECTRIC ENERGY
    • H02J3/00Circuit arrangements for ac mains or ac distribution networks
    • H02J3/38Arrangements for parallely feeding a single network by two or more generators, converters or transformers
    • H02J3/46Controlling of the sharing of output between the generators, converters, or transformers
    • H02J3/48Controlling the sharing of the in-phase component
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0631Resource planning, allocation, distributing or scheduling for enterprises or organisations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/06Energy or water supply
    • HELECTRICITY
    • H02GENERATION; CONVERSION OR DISTRIBUTION OF ELECTRIC POWER
    • H02JCIRCUIT ARRANGEMENTS OR SYSTEMS FOR SUPPLYING OR DISTRIBUTING ELECTRIC POWER; SYSTEMS FOR STORING ELECTRIC ENERGY
    • H02J3/00Circuit arrangements for ac mains or ac distribution networks
    • H02J3/24Arrangements for preventing or reducing oscillations of power in networks
    • H02J3/241The oscillation concerning frequency
    • HELECTRICITY
    • H02GENERATION; CONVERSION OR DISTRIBUTION OF ELECTRIC POWER
    • H02JCIRCUIT ARRANGEMENTS OR SYSTEMS FOR SUPPLYING OR DISTRIBUTING ELECTRIC POWER; SYSTEMS FOR STORING ELECTRIC ENERGY
    • H02J3/00Circuit arrangements for ac mains or ac distribution networks
    • H02J3/28Arrangements for balancing of the load in a network by storage of energy
    • HELECTRICITY
    • H02GENERATION; CONVERSION OR DISTRIBUTION OF ELECTRIC POWER
    • H02JCIRCUIT ARRANGEMENTS OR SYSTEMS FOR SUPPLYING OR DISTRIBUTING ELECTRIC POWER; SYSTEMS FOR STORING ELECTRIC ENERGY
    • H02J2203/00Indexing scheme relating to details of circuit arrangements for AC mains or AC distribution networks
    • H02J2203/10Power transmission or distribution systems management focussing at grid-level, e.g. load flow analysis, node profile computation, meshed network optimisation, active network management or spinning reserve management
    • HELECTRICITY
    • H02GENERATION; CONVERSION OR DISTRIBUTION OF ELECTRIC POWER
    • H02JCIRCUIT ARRANGEMENTS OR SYSTEMS FOR SUPPLYING OR DISTRIBUTING ELECTRIC POWER; SYSTEMS FOR STORING ELECTRIC ENERGY
    • H02J2203/00Indexing scheme relating to details of circuit arrangements for AC mains or AC distribution networks
    • H02J2203/20Simulating, e g planning, reliability check, modelling or computer assisted design [CAD]
    • HELECTRICITY
    • H02GENERATION; CONVERSION OR DISTRIBUTION OF ELECTRIC POWER
    • H02JCIRCUIT ARRANGEMENTS OR SYSTEMS FOR SUPPLYING OR DISTRIBUTING ELECTRIC POWER; SYSTEMS FOR STORING ELECTRIC ENERGY
    • H02J2300/00Systems for supplying or distributing electric power characterised by decentralized, dispersed, or local generation
    • H02J2300/20The dispersed energy generation being of renewable origin
    • H02J2300/28The renewable source being wind energy

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Human Resources & Organizations (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Economics (AREA)
  • Strategic Management (AREA)
  • Power Engineering (AREA)
  • Marketing (AREA)
  • Software Systems (AREA)
  • Tourism & Hospitality (AREA)
  • General Engineering & Computer Science (AREA)
  • General Business, Economics & Management (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Bioethics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Development Economics (AREA)
  • Artificial Intelligence (AREA)
  • Educational Administration (AREA)
  • Data Mining & Analysis (AREA)
  • Game Theory and Decision Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Water Supply & Treatment (AREA)
  • Primary Health Care (AREA)

Abstract

A transmission and distribution frequency modulation resource cooperative control method based on a federal reinforcement learning algorithm comprises the following steps: dividing a regional power grid into a main net area and a plurality of net distribution areas; setting an agent in a dispatching center of each zone, and establishing a corresponding DQN neural network model for each agent; each intelligent agent uses the data of the corresponding patch to carry out local training on the corresponding DQN neural network model, encrypts the information of the DQN neural network model after the local training, and uploads the encrypted information to the aggregation center; the aggregation center carries out gradient average processing on the encryption information and returns the encryption information to each intelligent agent, each intelligent agent carries out subsequent training on the locally trained DQN neural network model by utilizing the information after gradient average processing to obtain a trained DQN neural network model, and a frequency modulation instruction of each unit which is received and scheduled in the regional power grid is obtained through the trained DQN neural network model. The design can ensure safe and efficient operation of the regional power grid and privacy safety of frequency modulation users.

Description

Transmission and distribution frequency modulation resource cooperative control method based on federal reinforcement learning algorithm
Technical Field
The invention belongs to the field of automatic power generation control of power systems, and particularly relates to a transmission and distribution frequency modulation resource cooperative control method based on a federal reinforcement learning algorithm.
Background
The automatic power control (automatic power control, APC) is extended to the adjustable load side continuously on the basis of the traditional generator set, covers the original AGC technical connotation of the generator set, and also confirms the frequency modulation capability of flexible resources. Most of the flexible resources are accessed through a distribution network, and the development of communication and other technologies is carried out, so that the distribution network is gradually changed into a local power network with self-balancing capability from an original unidirectional power receiving network, and the relationship between the main power distribution network and the distribution network is also changed into a mutual supporting bidirectional interaction relationship from an original principal-subordinate attachment relationship. The traditional frequency modulation resources such as thermal power and hydropower are mostly connected to the main network side, the topology of the power grid at the side is relatively simple in structure compared with a distribution network, and the distributed power source is mostly connected to the distribution network side, and the distributed power source receives the power after being scheduled to increase or reduce the power, so that the influence on the operation of the power distribution network is not negligible, and the main function of the power distribution network is to provide reliable electric energy for users all the time. In this context, how to combine these power characteristics with the different resources of the environment to participate in APC is a challenge in exploring the development of new power systems.
The current closed loop control process of the regional power grid APC is mainly divided into 2 processes: 1) Collecting the frequency deviation and the tie line power deviation of a power grid, calculating a real-time regional control deviation ACE, and obtaining a total generated power instruction through a PI controller; 2) And distributing the instruction to each APC unit by using a related power distribution method. At present, the total regulating power command is mainly distributed according to the adjustable capacity of the unit, but the strategy cannot meet the optimal control requirement of the system. Meanwhile, the traditional centralized control has large calculated amount, centralized communication and poor reliability, and cannot adapt to the active power distribution network structure with flexible and changeable structure, so that the control mode of the system is gradually changed from centralized control to distributed control, but the distributed control of each frequency modulation unit installation intelligent body is difficult to realize the integral optimization of an autonomous area due to the large dispersion characteristic of a distributed power supply. In addition, the increasing number of distributed power sources has led to an increased trend towards multiple principals, and multi-principal privacy has also been a threat. Based on the existing problems of distributed control, a flexible power optimal allocation strategy is needed to ensure safe and efficient operation of regional power grids and privacy security of frequency modulation users.
Disclosure of Invention
The invention aims to overcome the problems that the existing control method in the prior art is difficult to meet the optimal control requirement of a system, cannot adapt to an active power distribution network structure with flexible and changeable structures and the privacy security of frequency modulation users faces threat, and provides a transmission and distribution frequency modulation resource cooperative control method based on a federal reinforcement learning algorithm, which can meet the optimal control requirement of the system, adapt to the active power distribution network structure with flexible and changeable structures and ensure the safe and efficient operation of regional power grids and the privacy security of the frequency modulation users.
In order to achieve the above object, the technical solution of the present invention is:
a cooperative control method for frequency modulation resources of transmission and distribution based on a federal reinforcement learning algorithm comprises the following steps:
s1, dividing a regional power grid into a main net area and a plurality of net distribution areas;
s2, setting an agent in a dispatching center of each zone, and establishing a corresponding DQN neural network model for each agent;
s3, each agent respectively uses local data of a corresponding patch to perform local training on the corresponding DQN neural network model, performs homomorphic encryption on the information of the DQN neural network model after the local training, and uploads the encrypted information to the aggregation center;
S4, the aggregation center carries out gradient average processing on all the encrypted information, sends the information subjected to gradient average processing to each intelligent agent, receives the information subjected to gradient average processing, carries out subsequent training on the corresponding local trained DQN neural network model according to the information subjected to gradient average processing, and obtains a trained DQN neural network model, and the frequency modulation instruction of each unit which is subjected to scheduling is obtained through the trained DQN neural network model.
In the step S3, when each agent uses the local data of the corresponding patch to perform local training on the DQN neural network model, each agent state space, action space and rewarding function are set according to the markov decision process;
the state space for setting the z-number agent specifically comprises:
the size of a total frequency adjusting instruction for determining the total deviation of frequency response in the frequency allocation process is taken as the state space of the z-number intelligent agent;
the state of the z-number agent at the time t is S z,t
The action space for setting the z-number intelligent agent specifically comprises:
set up action space A that z number agent can decision z All control behaviors of the z-number agent are performed in the action space A z Selecting;
control behavior a of z-number intelligent agent at t moment z,t Can be expressed as:
Figure BDA0004031032760000031
in the formula (1):
Figure BDA0004031032760000032
active output of the o-th thermal power unit controlled by the z-number intelligent agent at the time t; />
Figure BDA0004031032760000033
The active force of the mth energy storage device controlled by the z-number intelligent agent at the time t; />
Figure BDA0004031032760000034
The active output of the nth wind turbine generator set controlled by the z-number intelligent agent at the time t; />
Figure BDA0004031032760000035
The active force of the jth electric automobile group controlled by the z-number intelligent agent at the t moment;
the setting of the rewarding function of the z-number agent specifically comprises the following steps:
setting rewards of the environment on the control behaviors of the z-number intelligent agent, aiming at minimizing deviation of the adjustment power instruction value and the power response value, and constructing a rewarding function of the z-number intelligent agent:
Figure BDA0004031032760000036
Figure BDA0004031032760000037
in the formulae (2) to (3), R z,t The rewarding function of the z-number intelligent agent at the time t; q is the number of control periods; q is the number of APC units in the patch corresponding to the z-number intelligent agent; i is the ith APC unit in the patch corresponding to the z-number agent; t is the t-th discrete control period; deltaP i G The power command value is adjusted for the input of the ith APC unit in the zone corresponding to the z-number intelligent agent; deltaP i R The power response value of the ith APC unit in the patch corresponding to the z-number intelligent agent;
the cost function for the objective function minF is obtained by discounted cumulative summation:
Figure BDA0004031032760000038
In the formula (4), the amino acid sequence of the compound,
Figure BDA0004031032760000039
to control the behavior a z All jackpots generated are averaged; gamma ray t' ∈[0,1],γ t' Is a discount coefficient; r is R z,t' Is the accumulation of bonus functions corresponding to a plurality of consecutive behaviors of the z-number agent.
In the step S3, the local training of the corresponding DQN neural network model by the z-number agent using the local data of the corresponding patch specifically includes:
s31, initializing current network parameters of a corresponding DQN neural network model by a z-number agent, and copying a target network with the same structure;
s32, training the DQN neural network model by the z-number agent according to the state data of 96 time periods in the day of the corresponding patch, and updating the parameters of the target network.
In the step S32, training the DQN neural network model by the z-agent with the state data of 96 time periods within a day corresponding to the patch includes:
s321, selecting state data of a time period from state data of 96 time periods in the day as the current state S of the z-number intelligent agent t
S322, current state S based on z number agent t Trial-error is performed by adopting an epsilon-greedy strategy, namely, the control behavior a is selected by using a random strategy with probability epsilon t Selecting current optimal control behavior with probability 1-epsilon
Figure BDA0004031032760000041
a t 、/>
Figure BDA0004031032760000042
Wherein:
Figure BDA0004031032760000043
S323, calculating the magnitude r of the rewarding function after executing the control behavior a according to the selected control behavior a and the current network in the DQN neural network model t And the Q value is calculated by the following function:
Q(s t ,a t )=Q(s t ,a t )+η[r t +μmaxQ(s t+1 ,a t+1 )-Q(s t ,a t )] (6);
in the formulas (5) and (6), Q(s) t ,a t ) Is the current Q value; max Q(s) t+1 ,a t+1 ) Is the target Q value; η is the learning rate; mu is a reward attenuation coefficient;
s324, according to the selected control behavior a, acquiring the next state S returned by the environment after the z-number intelligent agent executes the selected control behavior a t+1 Obtaining an empirical sample (s t ,a,r t ,s t+1 ) And the empirical sample (s t ,a,r t ,s t+1 ) Storing the experience playback pool;
s325, updating the current state of the z-number agent to be the next state returned by the environment, and repeating the steps S322-S324 until the experience playback pool is full;
s326, after the experience playback pool is full, extracting omega experience samples from the experience playback pool for calculation, and updating the loss function:
Figure BDA0004031032760000044
in the formula (7): f (F) z As a loss function; r is (r) i,z A reward function for the z-number agent; q(s) z,i ,a z,i ) The Q value of the current network;
Figure BDA0004031032760000045
the target Q value corresponding to the experience sample.
In the step S3, the addition homomorphic encryption is performed on the information of the locally trained DQN neural network model, and the uploading aggregation center of the encrypted information specifically includes:
S34, each intelligent agent encrypts a corresponding loss function in the locally trained DQN neural network model by using the Paillier full homomorphic encryption public key K to obtain an encrypted loss function;
s35, each agent transmits the encrypted loss function to the aggregation center.
In step S4, the aggregation center performs gradient average processing on all the encrypted information, and sends the information after gradient average processing to each agent, each agent receives the information after gradient average processing, and performs subsequent training on the corresponding locally trained DQN neural network model according to the information after gradient average processing, which specifically includes:
s41, the aggregation center calculates and obtains a comprehensive loss function according to the encrypted loss function sent by each intelligent agent
Figure BDA0004031032760000054
Figure BDA0004031032760000051
In formula (8):
Figure BDA0004031032760000052
representing the summation of multiple encrypted loss functions, R y,z Bonus function for agent z +.>
Figure BDA0004031032760000053
The Q value of the target network corresponding to the z-number intelligent agent; q(s) z,i ,a z,i ) The current Q value corresponding to the z-number intelligent agent; η is the learning rate; mu is a reward attenuation coefficient; y is the total number of agents;
s42, the aggregation center synthesizes the loss function
Figure BDA0004031032760000055
Transmitting to each agent according to the current network and the comprehensive loss function in the corresponding local trained DQN neural network model >
Figure BDA0004031032760000056
Calculating gradient information;
s43, each agent adds a safety mask on the gradient information, and transmits the gradient information with the safety mask to the aggregation center;
s44, after receiving the gradient information added with the security mask, the aggregation center releases homomorphic encryption of the gradient information, and returns a result after releasing homomorphic encryption to the corresponding intelligent agent;
s45, each agent receives the result after the full homomorphic encryption is released, and removes the subnet mask in the result to obtain gradient information without encryption, and each agent updates the corresponding parameter theta of the current network in the local trained DQN neural network model through the gradient information without encryption z,t
In the step S45, the corresponding local trained parameters θ of the current network in the DQN neural network model are updated z,t The formula of (2) is:
Figure BDA0004031032760000061
in the formula (9): f is a loss function, θ z,t Updated current network parameters for z-number agent, θ z,t-1 The current network parameters before the z-number agent is updated.
The control behavior in the action space accords with the power characteristic constraint condition and the system balance constraint condition.
The power supply characteristic constraint condition specifically includes:
thermal power generating unit operation constraint:
Figure BDA0004031032760000062
in the formula (10), the amino acid sequence of the compound,
Figure BDA0004031032760000063
the upper limit and the lower limit of the output of the ith thermal power unit are respectively set; / >
Figure BDA0004031032760000064
The climbing rate of the ith thermal power generating unit; />
Figure BDA0004031032760000065
The output of the ith thermal power unit at the moment t; />
Figure BDA0004031032760000066
The output of the ith thermal power unit at the t-1 moment;
energy storage device operation constraints:
Figure BDA0004031032760000067
in the formula (11), the amino acid sequence of the compound,
Figure BDA0004031032760000068
capacity of the mth energy storage deviceA constraint range; />
Figure BDA0004031032760000069
The capacity of the mth energy storage device at the t moment; />
Figure BDA00040310327600000610
The charging power of the mth energy storage device at the t moment; />
Figure BDA00040310327600000611
A charging power constraint range for the mth energy storage device; />
Figure BDA00040310327600000612
The discharge power of the mth energy storage device at the t moment; />
Figure BDA00040310327600000613
And->
Figure BDA00040310327600000614
The discharge power constraint range of the mth energy storage device; />
Figure BDA00040310327600000615
The battery capacity of the mth energy storage device at the time t+1; />
Figure BDA00040310327600000616
The self-discharge efficiency of the mth energy storage device; />
Figure BDA00040310327600000617
Charging efficiency of the mth energy storage device; />
Figure BDA00040310327600000618
The discharge efficiency of the mth energy storage device;
running constraint of distributed wind farm units:
Figure BDA00040310327600000619
in the formula (12), the amino acid sequence of the compound,
Figure BDA0004031032760000071
the lower limit of the output power of the nth wind driven generator; />
Figure BDA0004031032760000072
The upper limit of the output power of the nth wind driven generator; />
Figure BDA0004031032760000073
The output power of the nth wind driven generator at the t moment;
electric automobile group state constraint:
Figure BDA0004031032760000074
Figure BDA0004031032760000075
Figure BDA0004031032760000076
Figure BDA0004031032760000077
/>
in the formulae (13) to (16),
Figure BDA0004031032760000078
and->
Figure BDA0004031032760000079
The SOC constraint range of the jth electric automobile; />
Figure BDA00040310327600000710
SOC of the j-th electric automobile; />
Figure BDA00040310327600000711
Outputting a constraint range of power increment for an M-th electric vehicle charging station; / >
Figure BDA00040310327600000712
The output power increment of the charging station of the M-th electric vehicle; />
Figure BDA00040310327600000713
A period of time for accessing a charging station for a single electric vehicle; />
Figure BDA00040310327600000714
The upper limit of the charge and discharge power of the jth electric automobile at the t moment; />
Figure BDA00040310327600000715
The lower limit of the charging and discharging power of the jth electric automobile at the t moment; />
Figure BDA00040310327600000716
And->
Figure BDA00040310327600000717
Is influenced by the number j of vehicles in the charging station, the SOC capacity of a single electric vehicle and the charge and discharge state factors of the single electric vehicle; />
Figure BDA00040310327600000718
The SOC of the j-th electric automobile at the t moment; />
Figure BDA00040310327600000719
Rated charging power for the j-th electric automobile;
Figure BDA00040310327600000720
rated discharge power of the j-th electric automobile; />
Figure BDA00040310327600000721
The output power increment of the charging pile of the M-th electric automobile at the t moment is set;/>
Figure BDA00040310327600000722
the upper limit of the output power increment of the charging pile of the M-th electric automobile at the moment t; />
Figure BDA0004031032760000081
The lower limit of the output power increment of the charging pile of the M-th electric automobile at the moment t; />
Figure BDA0004031032760000082
The power of the j electric automobile at the t moment.
The system balance constraint condition specifically comprises:
Figure BDA0004031032760000083
Figure BDA0004031032760000084
Figure BDA0004031032760000085
in the formulae (17) to (19),
Figure BDA0004031032760000086
respectively representing the t moment as the active output of the ith thermal power unit, the mth energy storage device, the nth wind motor group and the jth electric automobile; n (N) g 、N b 、N w 、N e Respectively representing the number of thermal power units, energy storage equipment, wind power units and electric automobiles; />
Figure BDA0004031032760000087
Load disturbance at time t is represented; o1 (l) =b is a set of branches with a head-end node being node b; o2 (l) =b is the set of branches with end node being node b; p (P) l,t And Q l,t The active power and the reactive power of the branch circuit l at the time t are respectively;r l and x l The resistance and reactance of branch l, U 0 The voltage amplitude of the relaxation node at the time t; u (U) b,t The voltage amplitude of the node b at the time t; />
Figure BDA0004031032760000088
Active power of a generator set connected with the node b; />
Figure BDA0004031032760000089
Reactive power of a generator set connected with the node b; />
Figure BDA00040310327600000810
Active power for a load connected to node b; />
Figure BDA00040310327600000811
Reactive power for a load connected to node b; p (P) l,max And P l,min The upper limit and the lower limit of the active power of the branch circuit l are respectively; q (Q) l,max And Q l,min The upper limit and the lower limit of reactive power of the branch I are respectively; u (U) b,max And U b,min The upper and lower limits of the voltage at node b, respectively.
Compared with the prior art, the invention has the beneficial effects that:
1. according to the transmission and distribution frequency modulation resource cooperative control method based on the federal reinforcement learning algorithm, a regional power grid is divided into a main net zone and a plurality of distribution net zones, a piece of intelligent agent is arranged in a dispatching center of each piece of the regional power grid, the corresponding piece of intelligent agent is used for controlling, the combined frequency modulation of a traditional unit represented by the main net side and multiple types of generator sets of distributed power sources on the distribution net side is realized, the frequency modulation instruction distribution of the dispatching unit is realized by utilizing information interaction among the piece of intelligent agent, and the integral optimization of an autonomous region is realized. Therefore, in the design, the main distribution network structure of the regional power grid is fully utilized to divide the areas and set the intelligent bodies, the frequency modulation instruction distribution of the dispatching unit is realized by utilizing the information interaction among the intelligent bodies, and the integral optimization of the autonomous region is realized.
2. According to the transmission and distribution frequency modulation resource cooperative control method based on the federal reinforcement learning algorithm, the problem of cooperation among multiple intelligent agents is solved by using the federal reinforcement learning distributed algorithm, when each DQN neural network is trained, 96 point states in the day of the corresponding distribution network are adopted for offline training, the online decision time is effectively shortened by using the offline training, and the real-time performance of instruction execution is further improved. Therefore, in the design, the online decision time is effectively shortened by utilizing offline training, and the real-time performance of instruction execution is improved.
3. In the method, when adding homomorphic encryption is carried out on information of a local model of a DQN neural network and uploading the encrypted information to an aggregation center, each agent respectively encrypts a corresponding loss function by using a Paillier homomorphic encryption public key to obtain an encrypted loss function and transmits the encrypted loss function to the aggregation center, the aggregation center calculates the comprehensive loss function according to the encrypted loss function transmitted by each agent and transmits the comprehensive loss function to each agent, each agent respectively calculates gradient information of a current network in the local model of the DQN neural network corresponding to the comprehensive loss function, then each agent adds a security mask on the gradient information and transmits the gradient information added with the security mask to the aggregation center, the aggregation center receives the gradient information added with the security mask and then releases the homomorphic encryption of the gradient information and returns a result of the release of the homomorphic encryption to the corresponding agent, each agent updates a current parameter of the DQN neural network through the result of the release of the encryption and transmits the data in the local model of the DQN neural network, and the method is only used for the mutual encryption of the data is prevented from being leaked by the data in the method, and the mutual encryption risk of the data is prevented. Therefore, only the parameters of the model are transmitted and processed in the design, and the model parameters are encrypted by using the fully homomorphic encryption public key, so that the risk of data leakage in transmission and storage is avoided, and the privacy security of the frequency modulation user is ensured.
4. According to the transmission and distribution frequency modulation resource cooperative control method based on the federation reinforcement learning algorithm, an DQN training model is adopted for a neural network training framework of the federation reinforcement learning, and an optimal strategy is obtained by optimizing a state-action pair value function matrix Q (s, a) of iterative computation, so that the sum of expected discount returns is maximum, and the DQN training model can be well integrated into a multi-source cooperative frequency control framework of a main distribution network, and is more suitable for solving the distributed optimization problem of dynamic distribution of APC power. Therefore, the DQN training model is adopted in the design, so that the method is more suitable for solving the distributed optimization problem of dynamic distribution of the APC power.
Drawings
Fig. 1 is a frame diagram of a federal reinforcement learning algorithm-based transmission and distribution frequency modulation resource cooperative control system.
FIG. 2 is a schematic diagram of a federal reinforcement learning framework based on a DQN training network.
Detailed Description
The invention is described in further detail below with reference to the accompanying drawings and detailed description.
Referring to fig. 1 to 2, a cooperative control method for frequency modulation resources of a transmission and distribution system based on a federal reinforcement learning algorithm, the control method comprises the following steps:
s1, dividing a regional power grid into a main net area and a plurality of net distribution areas;
S2, setting an agent in a dispatching center of each zone, and establishing a corresponding DQN neural network model for each agent;
the dispatching centers of each patch are provided with an agent, the dispatching center of each patch is provided with an DQN neural network model, the agent positioned in the dispatching center of the same patch corresponds to the DQN neural network model, and the agent only trains, controls and operates the DQN neural network model corresponding to the agent.
S3, each agent respectively uses local data of a corresponding patch to perform local training on a corresponding DQN neural network model, performs addition homomorphic encryption on the information of the DQN neural network after the local training, and uploads the encrypted information to a polymerization center;
after the aggregation center obtains the total frequency modulation instruction, a training task, namely the frequency modulation instruction distribution of each unit, is issued, and after the aggregation center issues the training task, all the agents begin to execute the same task, namely, each agent utilizes the local data of each area to perform local training on the DQN neural network model.
S4, the aggregation center carries out gradient average processing on all the encrypted information, sends the information subjected to gradient average processing to each intelligent agent, receives the information subjected to gradient average processing, carries out subsequent training on the locally trained DQN neural network according to the information subjected to gradient average processing, and obtains a corresponding trained DQN neural network model, and the intelligent agent obtains a frequency modulation instruction of each unit which is subjected to scheduling in a corresponding patch area through the corresponding trained DQN neural network model.
In the step S3, when each agent uses the local data of the corresponding patch to perform local training on the DQN neural network model, each agent state space, action space and rewarding function are set according to the markov decision process;
the defining the state space of the agent specifically comprises:
the state space for setting the z-number agent specifically comprises:
the size of a total frequency adjusting instruction for determining the total deviation of frequency response in the frequency allocation process is taken as the state space of the z-number intelligent agent;
the amplitude of the frequency variation is divided into eight intervals:
{[∞,-0.2),[-0.2,-0.15),[-0.15,-0.10),[-0.10,0.003),[0.03,0.10),[0.10,0.15),[0.15,0.2),[0.2,+∞)};
the state of the z-number agent at the time t is S z,t ,S z,t ={S 1 ,S 2 ,S 3 ,S 4 ,S 5 ,S 6 ,S 7 ,S 8 S, where S 1 、S 8 Respectively representing states corresponding to the minimum value and the maximum value of the total frequency adjustment instruction of the system under a certain disturbance type;
the action space of the definition agent specifically comprises:
set up action space A that z number agent can decision z All control behaviors of the z-number agent are performed in the action space A z Selecting;
control behavior a of z-number intelligent agent at t moment z,t Can be expressed as:
Figure BDA0004031032760000111
in the formula (1):
Figure BDA0004031032760000112
active output of the o-th thermal power unit controlled by the z-number intelligent agent at the time t; />
Figure BDA0004031032760000113
The active force of the mth energy storage device controlled by the z-number intelligent agent at the time t; / >
Figure BDA0004031032760000114
The active output of the nth wind turbine generator set controlled by the z-number intelligent agent at the time t; />
Figure BDA0004031032760000115
The active force of the jth electric automobile group controlled by the z-number intelligent agent at the t moment;
the control behavior of the agent needs to meet constraints of the action space. Constraints of the action space include the following two classes: taking the power characteristic constraint of the dynamic response transmission process of the unit into consideration; the system balance constraint of the whole stable operation of the power system is considered, wherein the constraint condition of the system balance constraint mainly considers the difference between the joint frequency modulation of the main distribution network resources and the conventional multi-source collaborative frequency modulation.
The power supply characteristic constraint condition specifically includes:
thermal power generating unit operation constraint:
Figure BDA0004031032760000116
in the formula (10), the amino acid sequence of the compound,
Figure BDA0004031032760000117
the upper limit and the lower limit of the output of the ith thermal power unit are respectively set; />
Figure BDA0004031032760000118
The climbing rate of the ith thermal power generating unit; />
Figure BDA0004031032760000119
The output of the ith thermal power unit at the moment t; />
Figure BDA00040310327600001110
The output of the ith thermal power unit at the t-1 moment;
energy storage device operation constraints:
Figure BDA0004031032760000121
in the formula (11), the amino acid sequence of the compound,
Figure BDA0004031032760000122
the minimum capacity of the mth energy storage device; />
Figure BDA0004031032760000123
Maximum capacity of the mth energy storage device; />
Figure BDA0004031032760000124
The battery capacity of the mth energy storage device at the t moment; />
Figure BDA0004031032760000125
The charging power of the mth energy storage device at the t moment;
Figure BDA0004031032760000126
a charging power constraint range for the mth energy storage device; / >
Figure BDA0004031032760000127
The discharge power of the mth energy storage device at the t moment; />
Figure BDA0004031032760000128
And->
Figure BDA0004031032760000129
The discharge power constraint range of the mth energy storage device; />
Figure BDA00040310327600001210
The battery capacity of the mth energy storage device at the time t+1; />
Figure BDA00040310327600001211
The self-discharge efficiency of the mth energy storage device; />
Figure BDA00040310327600001212
Charging efficiency of the mth energy storage device; />
Figure BDA00040310327600001213
The discharge efficiency of the mth energy storage device;
running constraint of distributed wind farm units:
Figure BDA00040310327600001214
in the formula (12), the amino acid sequence of the compound,
Figure BDA00040310327600001215
the lower limit of the output power of the nth wind driven generator; />
Figure BDA00040310327600001216
The upper limit of the output power of the nth wind driven generator; />
Figure BDA00040310327600001217
The output power of the nth wind driven generator at the t moment;
electric automobile group state constraint:
Figure BDA00040310327600001218
Figure BDA00040310327600001219
/>
Figure BDA00040310327600001220
Figure BDA0004031032760000131
in the formulae (13) to (16),
Figure BDA0004031032760000132
and->
Figure BDA0004031032760000133
The SOC constraint range of the jth electric automobile; />
Figure BDA0004031032760000134
SOC of the j-th electric automobile; />
Figure BDA0004031032760000135
Outputting a constraint range of power increment for an M-th electric vehicle charging station; />
Figure BDA0004031032760000136
The output power increment of the charging station of the M-th electric vehicle; />
Figure BDA0004031032760000137
A period of time for accessing a charging station for a single electric vehicle; />
Figure BDA0004031032760000138
The upper limit of the charge and discharge power of the jth electric automobile at the t moment; />
Figure BDA0004031032760000139
The lower limit of the charging and discharging power of the jth electric automobile at the t moment; />
Figure BDA00040310327600001310
And->
Figure BDA00040310327600001311
Is influenced by the number j of vehicles in the charging station, the SOC capacity of a single electric vehicle and the charge and discharge state factors of the single electric vehicle; / >
Figure BDA00040310327600001312
The SOC of the j-th electric automobile at the t moment; />
Figure BDA00040310327600001313
Rated charging power for the j-th electric automobile;
Figure BDA00040310327600001314
rated discharge power of the j-th electric automobile; />
Figure BDA00040310327600001315
The output power increment of the charging pile of the M-th electric automobile at the t moment is set; />
Figure BDA00040310327600001316
The upper limit of the output power increment of the charging pile of the M-th electric automobile at the moment t; />
Figure BDA00040310327600001317
The lower limit of the output power increment of the charging pile of the M-th electric automobile at the moment t; />
Figure BDA00040310327600001318
The power of the j electric automobile at the t moment.
The system balance constraint condition specifically comprises:
Figure BDA00040310327600001319
Figure BDA00040310327600001320
Figure BDA00040310327600001321
wherein, the formula (17) is a system power balance constraint, and the formulas (18) and (19) are distribution network operation related constraints;
in the formulae (17) to (19),
Figure BDA0004031032760000141
respectively representing the t moment as the active output of the ith thermal power unit, the mth energy storage device, the nth wind motor group and the jth electric automobile; n (N) g 、N b 、N w 、N e Respectively representing the number of thermal power units, energy storage equipment, wind power units and electric automobiles; />
Figure BDA0004031032760000142
Load disturbance at time t is represented; o1 (l) =b is a set of branches with a head-end node being node b; o2 (l) =b is the set of branches with end node being node b; p (P) l,t And Q l,t The active power and the reactive power of the branch circuit l at the time t are respectively; r is (r) l And x l The resistance and reactance of branch l, U 0 The voltage amplitude of the relaxation node at the time t; u (U) b,t The voltage amplitude of the node b at the time t; u (U) b+1,t The voltage amplitude of the node b at the time t+1; />
Figure BDA0004031032760000143
Active power of a generator set connected with the node b; />
Figure BDA0004031032760000144
Reactive power of a generator set connected with the node b; />
Figure BDA0004031032760000145
Active power for a load connected to node b; />
Figure BDA0004031032760000146
Reactive power for a load connected to node b; p (P) l,max And P l,min The upper limit and the lower limit of the active power of the branch circuit l are respectively; q (Q) l,max And Q l,min The upper limit and the lower limit of reactive power of the branch I are respectively; u (U) b,max And U b,min The upper and lower limits of the voltage at node b, respectively.
The setting of the rewarding function of the z-number agent specifically comprises the following steps:
setting rewards of environment on control behaviors of the z-number intelligent agent, minimizing deviation of the adjustment power command value and the power response value by changing the control behaviors of the z-number intelligent agent, namely, aiming at minimizing deviation of the adjustment power command value and the power response value, and constructing an objective function minF and a rewarding function R of the z-number intelligent agent z,t
Figure BDA0004031032760000147
The bonus function is:
Figure BDA0004031032760000148
in the formulae (2) to (3), R z,t The rewarding function of the z-number intelligent agent at the time t; q is the number of control periods; q is the number of APC units in the patch corresponding to the z-number intelligent agent; i is the ith APC unit in the patch corresponding to the z-number agent; t is the t-th discrete control period; deltaP i G The power command value is adjusted for the input of the ith APC unit in the zone corresponding to the z-number intelligent agent; deltaP i R In the region corresponding to the z-number intelligent agentThe power response value of the ith APC unit;
the cost function for the objective function min F is obtained by discounted cumulative summation:
Figure BDA0004031032760000151
in the formula (4), the amino acid sequence of the compound,
Figure BDA0004031032760000152
for z-number intelligent agent in control action a z A function of making a corresponding reward for the control action; />
Figure BDA0004031032760000158
To control the behavior a z All jackpots generated are averaged; gamma ray t' ∈[0,1],γ t' Is a discount coefficient; r is R z,t' Is the accumulation of bonus functions corresponding to a plurality of consecutive behaviors of the z-number agent.
In the step S3, the local training of the DQN neural network model by the z-number agent using the local data of the corresponding patch specifically includes:
s31, initializing a current network parameter theta of a corresponding DQN neural network model by an agent 1,t 、θ 2,t ...θ z,t And copy a target network with the same structure as the current network
Figure BDA0004031032760000154
S32, training the DQN neural network model by the z-number agent according to the state data of 96 time periods in the day of the corresponding patch, and updating the parameters of the target network.
Dividing every 15 minutes in 24 hours into a time period, adding 96 time periods, acquiring state data of a patch corresponding to the z-number intelligent agent in the 96 time periods, namely, the state data of 96 time periods in the day, and training a corresponding DQN neural network model by using the data. The intelligent agent trains the DQN neural network model for a plurality of times, and immediately copies the current network parameters to the target network after training the DQN neural network model once by each intelligent agent, and updates the parameters of the target network.
In the step S32, training the DQN neural network model by the z-agent with the state data of 96 time periods within a day corresponding to the patch includes:
s321, the z-number intelligent agent acquires state data of 96 time periods in the day of the corresponding power grid sheet area, and selects the state data of one time period in the 96 point states in the day as the current state S of the z-number intelligent agent t
S322, current state S based on z number agent t Trial-error is performed by adopting an epsilon-greedy strategy, namely, the control behavior a is selected by using a random strategy with probability epsilon t Selecting current optimal control behavior with probability 1-epsilon
Figure BDA0004031032760000155
a t 、/>
Figure BDA0004031032760000156
Wherein:
Figure BDA0004031032760000157
equation (5) represents selecting an optimal Q value as a current Q value;
s323, calculating the magnitude r of the rewarding function after executing the control behavior a according to the selected control behavior a and the current network in the DQN neural network model t And the corresponding Q value is calculated by the following function:
Q(s t ,a t )=Q(s t ,a t )+η[r t +μmaxQ(s t+1 ,a t+1 )-Q(s t ,a t )] (6);
in the formulas (5) and (6), Q(s) t ,a t ) The Q value of the current network; maxQ(s) t+1 ,a t+1 ) The Q value of the target network; η is the learning rate; m is a reward attenuation coefficient; r is (r) t ∈{r 1,t ,r m,t ,...,r n,t },{r 1,t ,r m,t ,...,r n,t The z number agent rewards function set;
s324, according to the selected control behavior a, acquiring the next state S returned by the environment after the z-number intelligent agent executes the selected control behavior a t+1 Obtaining an empirical sample (s t ,a,r t ,s t+1 ) And the empirical sample (s t ,a,r t ,s t+1 ) Storing the experience playback pool;
s325, updating the current state of the z-number agent to be the next state returned by the environment, and repeating the steps S322-S324 until the experience playback pool is full;
s326, after the experience playback pool is full, extracting omega experience samples from the experience playback pool for calculation, and updating the loss function:
Figure BDA0004031032760000161
in the formula (7): f (F) z As a loss function; r is (r) i,z Is a reward function; q(s) z,i ,a z,i ) The Q value of the current network corresponding to the experience sample;
Figure BDA0004031032760000162
the Q value of the target network corresponding to the experience sample; s is(s) z,i ,s z,i+1 ∈S z,i The state of the current action and the state of the target network action belong to a state space set of the intelligent agent z; a, a i ,a z,i+1 ∈A z,i Indicating that both the current action and the target network action belong to the set of action spaces of agent z.
In the step S3, the addition homomorphic encryption is performed on the information of the DQN neural network after the local training, and the uploading aggregation center of the encrypted information specifically includes:
s34, each agent encrypts the corresponding loss function in the local trained DQN neural network model by using the Paillier full homomorphic encryption public key K to obtain the encrypted loss function
Figure BDA0004031032760000163
Representing the fully homomorphic encrypted result; / >
S35, each agent encrypts the loss function
Figure BDA0004031032760000164
To the aggregation center.
In the step S4, the aggregation center performs gradient average processing on all the encrypted information, and sends the information after gradient average processing to each agent specifically includes:
the aggregation center calculates and obtains a comprehensive loss function according to the encrypted loss function sent by each agent
Figure BDA0004031032760000175
Figure BDA0004031032760000176
And the comprehensive loss function->
Figure BDA0004031032760000177
Sending the data to each intelligent agent:
Figure BDA0004031032760000171
in formula (8):
Figure BDA0004031032760000172
representing the summation of multiple encrypted loss functions, R y,z Bonus function for agent z +.>
Figure BDA0004031032760000173
The Q value of the target network corresponding to the z-number intelligent agent; q(s) z,i ,a z,i ) The current network Q value corresponding to the z-number intelligent agent; η is the learning rate; m is a reward attenuation coefficient; y is the total number of agents;
each agent receives the information after gradient average treatment, and carries out subsequent training on the corresponding local trained DQN neural network according to the information after gradient average treatment, and specifically comprises the following steps:
receiving loss function of each agent
Figure BDA0004031032760000178
And calculating the current network relative comprehensive loss function in the corresponding local trained DQN neural network model>
Figure BDA0004031032760000179
Gradient information of (2);
each intelligent agent adds a safety mask on the gradient information and transmits the gradient information added with the safety mask to the aggregation center;
The aggregation center receives the gradient information added with the security mask, then releases homomorphic encryption of the gradient information, and returns the result after releasing homomorphic encryption to the corresponding intelligent agent;
each agent receives the result after the full homomorphic encryption is released, and removes the subnet mask in the result to obtain gradient information without encryption, and each agent updates the corresponding parameter theta of the current network in the local trained DQN neural network model through the gradient information without encryption z,t
Updating the corresponding local trained parameters theta of the current network in the DQN neural network model z,t The formula of (2) is:
Figure BDA0004031032760000174
in the formula (9): f is a loss function, θ z,t Updated current network parameters for z-number agent, θ z,t-1 The current network parameters before the z-number agent is updated.
Federal reinforcement learning follows a Markov decision process (markov decision process, MDP) with DQN as the training network neural model, and the MDP followed by each agent can be expressed as a tuple (z, s) t ,a t ,r,s t+1 ) Wherein z is the agent number; s is(s) t When it is the agent tA state of engraving; a, a t Control actions executed for the time t of the intelligent agent; r is the state s of the intelligent agent t Executing action a t A reward obtained later; s is(s) t+1 For the intelligent agent in state s t Executing control action a t And then transitions to the next time state. During federal reinforcement learning, as shown in fig. 2, an initial state is selected randomly, then a control action is selected based on the initial state, after the control action is selected, the agent will execute the control action in the environment, and then the environment will return to the next time state s t+1 And the prize r obtained, in which case the quadruple (z, s t ,a t ,r,s t+1 ) And storing into an experience pool. The next time state s will then be t+1 Regarded as the current state s t Repeating the above steps until the experience pool is full. And then, carrying out gradient average on error gradient functions in Q network training on a plurality of agents through an aggregation center, and returning information after gradient average to follow-up training of each agent by the aggregation center to guide, so that the plurality of agents train the same task and carry out information interaction.
The principle of the invention is explained as follows:
the intelligent agent comprises a main network intelligent agent and a distribution network intelligent agent, wherein the patch intelligent agent corresponding to the main network area is the main network intelligent agent, and the patch intelligent agent corresponding to the distribution network area is the distribution network intelligent agent, wherein the main network intelligent agent has the function of an aggregation center in federal reinforcement learning. As shown in fig. 1, a framework diagram of a coordinated control system of frequency modulation resources of transmission and distribution based on a federal reinforcement learning algorithm divides a regional power grid according to structures of a main network and a plurality of distribution networks, and sets a regional intelligent agent in each main network and a distribution network dispatching center. According to the design, on the basis of the basic structure, main function guidance and connection power supply characteristics of the main distribution network, the cooperative control problem of the frequency modulation resources of transmission and distribution is optimized, in the control transfer from the existing centralized control to the distributed mode, the fact that a calculation and information interaction platform is transferred from an original individual and main network dispatching center to a network side represented by a power distribution network side is considered, communication calculation pressure of the main network serving as a high-level dispatching center during dispatching can be relieved, and active control capability of the distribution network dispatching center under the characteristics of localization and activation of a future active power distribution network is fully exerted. Secondly, because the distributed power supply participates in the frequency modulation unit mostly from enterprises and users, the part of people are extremely sensitive to the problem of privacy safety, the federal reinforcement learning algorithm is utilized to solve the problem of cooperation among multiple intelligent agents, and on the premise of ensuring the privacy safety of the users, the online decision time is shortened by utilizing offline training, so that the requirement of distributed execution on real-time decision is met.
The APC unit is a generator unit which automatically tracks a power dispatching instruction in a specified output adjustment range and adjusts power generation/power utilization power in real time according to a certain adjustment rate so as to meet the requirements of active balance, frequency stability and tie line power control of a power system.
Example 1:
a cooperative control method for frequency modulation resources of transmission and distribution based on a federal reinforcement learning algorithm comprises the following steps:
s1, dividing a regional power grid into a main net area and a plurality of net distribution areas;
s2, setting an agent in a dispatching center of each zone, and establishing a corresponding DQN neural network model for each agent;
s3, each agent respectively uses local data of a corresponding patch to perform local training on the corresponding DQN neural network model, performs addition homomorphic encryption on the information of the DQN neural network model after the local training, and uploads the encrypted information to the aggregation center;
s4, the aggregation center carries out gradient average processing on all the encrypted information, sends the information subjected to gradient average processing to each intelligent agent, receives the information subjected to gradient average processing, carries out subsequent training on the corresponding local trained DQN neural network model according to the information subjected to gradient average processing, obtains a trained DQN neural network model, and obtains a frequency modulation instruction of each unit which is subjected to scheduling in a corresponding patch area through the trained DQN neural network model.
In the step S3, when each agent uses the local data of the corresponding patch to perform local training on the DQN neural network model, each agent state space, action space and rewarding function are set according to the markov decision process;
the state space for setting the z-number agent specifically comprises:
the size of a total frequency adjusting instruction for determining the total deviation of frequency response in the frequency allocation process is taken as the state space of the z-number intelligent agent;
the state of the z-number agent at the time t is S z,t
The action space for setting the z-number intelligent agent specifically comprises:
set up action space A that z number agent can decision z All control behaviors of the z-number agent are performed in the action space A z Selecting;
control behavior a of z-number intelligent agent at t moment z,t Can be expressed as:
Figure BDA0004031032760000191
in the formula (1):
Figure BDA0004031032760000192
active output of the o-th thermal power unit controlled by the z-number intelligent agent at the time t; />
Figure BDA0004031032760000193
The active force of the mth energy storage device controlled by the z-number intelligent agent at the time t; />
Figure BDA0004031032760000194
The active output of the nth wind turbine generator set controlled by the z-number intelligent agent at the time t; />
Figure BDA0004031032760000201
The active force of the jth electric automobile group controlled by the z-number intelligent agent at the t moment;
the setting of the rewarding function of the z-number agent specifically comprises the following steps:
setting rewards of the environment on the control behaviors of the z-number intelligent agent, aiming at minimizing deviation of the adjustment power instruction value and the power response value, and constructing a rewarding function of the z-number intelligent agent:
Figure BDA0004031032760000202
Figure BDA0004031032760000203
In the formulae (2) to (3), R z,t The rewarding function of the z-number intelligent agent at the time t; q is the number of control periods; q is the number of APC units in the patch corresponding to the z-number intelligent agent; i is the ith APC unit in the patch corresponding to the z-number agent; t is the t-th discrete control period; deltaP i G The power command value is adjusted for the input of the ith APC unit in the zone corresponding to the z-number intelligent agent; deltaP i R The power response value of the ith APC unit in the patch corresponding to the z-number intelligent agent;
the cost function for the objective function minF is obtained by discounted cumulative summation:
Figure BDA0004031032760000204
in the formula (4), the amino acid sequence of the compound,
Figure BDA0004031032760000205
to control the behavior a z All jackpots generated are averaged; gamma ray t' ∈[0,1],γ t' Is a discount coefficient; r is R z,t' Is the accumulation of bonus functions corresponding to a plurality of consecutive behaviors of the z-number agent.
In the step S3, the local training of the DQN neural network model by the z-number agent using the local data of the corresponding patch specifically includes:
s31, initializing current network parameters of a corresponding DQN neural network model by a z-number agent, and copying a target network with the same structure;
s32, training the DQN neural network model by the z-number agent according to the state data of 96 time periods in the day of the corresponding patch, and updating the parameters of the target network.
In the step S32, training the DQN neural network model by the z-agent with the state data of 96 time periods within a day corresponding to the patch includes:
s321, selecting state data of a time period from state data of 96 time periods in the day as the current state S of the z-number intelligent agent t
S322, current state S based on z number agent t Trial-error is performed by adopting an epsilon-greedy strategy, namely, the control behavior a is selected by using a random strategy with probability epsilon t Selecting current optimal control behavior with probability 1-epsilon
Figure BDA0004031032760000211
a t 、/>
Figure BDA0004031032760000212
Wherein:
Figure BDA0004031032760000213
s323, calculating the magnitude r of the rewarding function after executing the control behavior a according to the selected control behavior a and the current network in the DQN neural network model t And the Q value is calculated by the following function:
Q(s t ,a t )=Q(s t ,a t )+η[r t +μmaxQ(s t+1 ,a t+1 )-Q(s t ,a t )] (6);
in the formulas (5) and (6), Q(s) t ,a t ) Is the current Q value; maxQ(s) t+1 ,a t+1 ) Is the target Q value; η is the learning rate; mu is a reward attenuation coefficient;
s324, according to the selected control behavior a, acquiring the z-number agent to execute the selectedTaking the next state s returned by the environment after the control action a t+1 Obtaining an empirical sample (s t ,a,r t ,s t+1 ) And the empirical sample (s t ,a,r t ,s t+1 ) Storing the experience playback pool;
s325, updating the current state of the z-number agent to be the next state returned by the environment, and repeating the steps S322-S324 until the experience playback pool is full;
S326, after the experience playback pool is full, extracting omega experience samples from the experience playback pool for calculation, and updating the loss function:
Figure BDA0004031032760000214
in the formula (7): f (F) z As a loss function; r is (r) i,z A reward function for the z-number agent; q(s) z,i ,a z,i ) The Q value of the current network;
Figure BDA0004031032760000215
the target Q value corresponding to the experience sample.
In the step S3, the addition homomorphic encryption is performed on the information of the locally trained DQN neural network model, and the uploading aggregation center of the encrypted information specifically includes:
s34, each intelligent agent encrypts a corresponding loss function in the locally trained DQN neural network model by using the Paillier full homomorphic encryption public key K to obtain an encrypted loss function;
s35, each agent transmits the encrypted loss function to the aggregation center.
In step S4, the aggregation center performs gradient average processing on all the encrypted information, and sends the information after gradient average processing to each agent, each agent receives the information after gradient average processing, and performs subsequent training on the corresponding locally trained DQN neural network model according to the information after gradient average processing, which specifically includes:
s41, the polymerization center is based on eachThe encrypted loss function sent by the intelligent agent is calculated to obtain the comprehensive loss function
Figure BDA0004031032760000224
Figure BDA0004031032760000221
In formula (8):
Figure BDA0004031032760000222
representing the summation of multiple encrypted loss functions, R y,z Bonus function for agent z +.>
Figure BDA0004031032760000223
The Q value of the target network corresponding to the z-number intelligent agent; q(s) z,i ,a z,i ) The current Q value corresponding to the z-number intelligent agent; η is the learning rate; mu is a reward attenuation coefficient; y is the total number of agents;
s42, the aggregation center synthesizes the loss function
Figure BDA0004031032760000225
Transmitting to each intelligent agent, each intelligent agent respectively calculating the relative comprehensive loss function of the current network in the corresponding local trained DQN neural network model>
Figure BDA0004031032760000226
Gradient information of (2);
s43, each agent adds a safety mask on the gradient information, and transmits the gradient information with the safety mask to the aggregation center;
s44, after receiving the gradient information added with the security mask, the aggregation center releases homomorphic encryption of the gradient information, and returns a result after releasing homomorphic encryption to the corresponding intelligent agent;
s45, each agent receives the result after the full homomorphic encryption is released, and removes the subnet mask in the result to obtain gradient information without encryption, each agent passes throughUpdating the corresponding local trained parameter θ of the current network in the DQN neural network model without encrypted gradient information z,t
In the step S45, the corresponding local trained parameters θ of the current network in the DQN neural network model are updated z,t The formula of (2) is:
Figure BDA0004031032760000231
in the formula (9): f is a loss function, θ z,t Updated current network parameters for z-number agent, θ z,t-1 The current network parameters before the z-number agent is updated.
Example 2:
example 2 is substantially the same as example 1 except that:
the control behavior in the action space accords with the power characteristic constraint condition and the system balance constraint condition.
The power supply characteristic constraint condition specifically includes:
thermal power generating unit operation constraint:
Figure BDA0004031032760000232
in the formula (10), the amino acid sequence of the compound,
Figure BDA0004031032760000233
the upper limit and the lower limit of the output of the ith thermal power unit are respectively set; />
Figure BDA0004031032760000234
The climbing rate of the ith thermal power generating unit; />
Figure BDA0004031032760000235
The output of the ith thermal power unit at the moment t; />
Figure BDA0004031032760000236
I-th thermal power machine at t-1 momentThe output of the group;
energy storage device operation constraints:
Figure BDA0004031032760000237
in the formula (11), the amino acid sequence of the compound,
Figure BDA0004031032760000238
the capacity constraint range of the mth energy storage device; />
Figure BDA0004031032760000239
The capacity of the mth energy storage device at the t moment; />
Figure BDA00040310327600002310
The charging power of the mth energy storage device at the t moment; />
Figure BDA00040310327600002311
A charging power constraint range for the mth energy storage device; />
Figure BDA00040310327600002312
The discharge power of the mth energy storage device at the t moment; />
Figure BDA00040310327600002313
And->
Figure BDA00040310327600002314
The discharge power constraint range of the mth energy storage device; />
Figure BDA00040310327600002315
The battery capacity of the mth energy storage device at the time t+1; />
Figure BDA00040310327600002316
The self-discharge efficiency of the mth energy storage device; / >
Figure BDA00040310327600002317
Charging efficiency of the mth energy storage device; />
Figure BDA00040310327600002318
The discharge efficiency of the mth energy storage device;
running constraint of distributed wind farm units:
Figure BDA00040310327600002319
in the formula (12), the amino acid sequence of the compound,
Figure BDA0004031032760000241
the lower limit of the output power of the nth wind driven generator; />
Figure BDA0004031032760000242
The upper limit of the output power of the nth wind driven generator; />
Figure BDA0004031032760000243
The output power of the nth wind driven generator at the t moment;
electric automobile group state constraint:
Figure BDA0004031032760000244
Figure BDA0004031032760000245
Figure BDA0004031032760000246
Figure BDA0004031032760000247
in the formulae (13) to (16),
Figure BDA0004031032760000248
and->
Figure BDA0004031032760000249
The SOC constraint range of the jth electric automobile; />
Figure BDA00040310327600002410
SOC of the j-th electric automobile; />
Figure BDA00040310327600002411
Outputting a constraint range of power increment for an M-th electric vehicle charging station; />
Figure BDA00040310327600002412
The output power increment of the charging station of the M-th electric vehicle; />
Figure BDA00040310327600002413
A period of time for accessing a charging station for a single electric vehicle; />
Figure BDA00040310327600002414
The upper limit of the charge and discharge power of the jth electric automobile at the t moment; />
Figure BDA00040310327600002415
The lower limit of the charging and discharging power of the jth electric automobile at the t moment; />
Figure BDA00040310327600002416
And->
Figure BDA00040310327600002417
Is influenced by the number j of vehicles in the charging station, the SOC capacity of a single electric vehicle and the charge and discharge state factors of the single electric vehicle; />
Figure BDA00040310327600002418
The SOC of the j-th electric automobile at the t moment; />
Figure BDA00040310327600002419
Rated charging power for the j-th electric automobile;
Figure BDA00040310327600002420
rated discharge power of the j-th electric automobile; />
Figure BDA00040310327600002421
The output power increment of the charging pile of the M-th electric automobile at the t moment is set; / >
Figure BDA00040310327600002422
The upper limit of the output power increment of the charging pile of the M-th electric automobile at the moment t; />
Figure BDA0004031032760000251
The lower limit of the output power increment of the charging pile of the M-th electric automobile at the moment t; />
Figure BDA0004031032760000252
The power of the j electric automobile at the t moment.
The system balance constraint condition specifically comprises:
Figure BDA0004031032760000253
Figure BDA0004031032760000254
Figure BDA0004031032760000255
in the formulae (17) to (19),
Figure BDA0004031032760000256
respectively representing the t moment as the active output of the ith thermal power unit, the mth energy storage device, the nth wind motor group and the jth electric automobile; n (N) g 、N b 、N w 、N e Respectively representing the number of thermal power units, energy storage equipment, wind power units and electric automobiles; p (P) t L Load disturbance at time t is represented; o1 (l) =b is a set of branches with a head-end node being node b; o2 (l) =b is the set of branches with end node being node b; p (P) l,t And Q l,t The active power and the reactive power of the branch circuit l at the time t are respectively; r is (r) l And x l The resistance and reactance of branch l, U 0 The voltage amplitude of the relaxation node at the time t; u (U) b,t The voltage amplitude of the node b at the time t; />
Figure BDA0004031032760000257
Active power of a generator set connected with the node b; />
Figure BDA0004031032760000258
Reactive power of a generator set connected with the node b; />
Figure BDA0004031032760000259
Active power for a load connected to node b; />
Figure BDA00040310327600002510
Reactive power for a load connected to node b; p (P) l,max And P l,min The upper limit and the lower limit of the active power of the branch circuit l are respectively; q (Q) l,max And Q l,min The upper limit and the lower limit of reactive power of the branch I are respectively; u (U) b,max And U b,min The upper and lower limits of the voltage at node b, respectively.
The above description is merely of preferred embodiments of the present invention, and the scope of the present invention is not limited to the above embodiments, but all equivalent modifications or variations according to the present disclosure will be within the scope of the claims.

Claims (10)

1. A transmission and distribution frequency modulation resource cooperative control method based on a federal reinforcement learning algorithm is characterized by comprising the following steps of:
the control method comprises the following steps:
s1, dividing a regional power grid into a main net area and a plurality of net distribution areas;
s2, setting an agent in a dispatching center of each zone, and establishing a corresponding DQN neural network model for each agent;
s3, each agent respectively uses local data of a corresponding patch to perform local training on the corresponding DQN neural network model, performs homomorphic encryption on the information of the DQN neural network model after the local training, and uploads the encrypted information to the aggregation center;
s4, the aggregation center carries out gradient average processing on all the encrypted information, sends the information subjected to gradient average processing to each intelligent agent, receives the information subjected to gradient average processing, carries out subsequent training on the corresponding local trained DQN neural network model according to the information subjected to gradient average processing, and obtains a trained DQN neural network model, and the frequency modulation instruction of each unit which is subjected to scheduling is obtained through the trained DQN neural network model.
2. The federal reinforcement learning algorithm-based transmission and distribution frequency modulation resource cooperative control method according to claim 1, wherein the method is characterized by comprising the following steps of:
in the step S3, when each agent uses the local data of the corresponding patch to perform local training on the DQN neural network model, each agent state space, action space and rewarding function are set according to the markov decision process;
the state space for setting the z-number agent specifically comprises:
the size of a total frequency adjusting instruction for determining the total deviation of frequency response in the frequency allocation process is taken as the state space of the z-number intelligent agent;
the state of the z-number agent at the time t is S z,t
The action space for setting the z-number intelligent agent specifically comprises:
set up action space A that z number agent can decision z All of the z-agentControl actions all from action space A z Selecting;
control behavior a of z-number intelligent agent at t moment z,t Can be expressed as:
Figure FDA0004031032750000011
in the formula (1):
Figure FDA0004031032750000012
active output of the o-th thermal power unit controlled by the z-number intelligent agent at the time t; />
Figure FDA0004031032750000013
The active force of the mth energy storage device controlled by the z-number intelligent agent at the time t; />
Figure FDA0004031032750000021
The active output of the nth wind turbine generator set controlled by the z-number intelligent agent at the time t; />
Figure FDA0004031032750000022
The active force of the jth electric automobile group controlled by the z-number intelligent agent at the t moment;
The setting of the rewarding function of the z-number agent specifically comprises the following steps:
setting rewards of the environment on the control behaviors of the z-number intelligent agent, aiming at minimizing deviation of the adjustment power instruction value and the power response value, and constructing a rewarding function of the z-number intelligent agent:
Figure FDA0004031032750000023
Figure FDA0004031032750000024
in the formulae (2) to (3), R z,t The rewarding function of the z-number intelligent agent at the time t; q is the number of control periods; q is the number of APC units in the patch corresponding to the z-number intelligent agent; i is the ith APC unit in the patch corresponding to the z-number agent; t is the t-th discrete control period; deltaP i G The power command value is adjusted for the input of the ith APC unit in the zone corresponding to the z-number intelligent agent; deltaP i R The power response value of the ith APC unit in the patch corresponding to the z-number intelligent agent;
the cost function for the objective function minF is obtained by discounted cumulative summation:
Figure FDA0004031032750000025
in the formula (4), the amino acid sequence of the compound,
Figure FDA0004031032750000026
to control the behavior a z All jackpots generated are averaged; gamma ray t′ ∈[0,1],γ t′ Is a discount coefficient; r is R z,t' Is the accumulation of bonus functions corresponding to a plurality of consecutive behaviors of the z-number agent.
3. The federal reinforcement learning algorithm-based transmission and distribution frequency modulation resource cooperative control method according to claim 2, wherein the method is characterized by comprising the following steps of:
in the step S3, the local training of the corresponding DQN neural network model by the z-number agent using the local data of the corresponding patch specifically includes:
S31, initializing current network parameters of a corresponding DQN neural network model by a z-number agent, and copying a target network with the same structure as the current network;
s32, training the DQN neural network model by the z-number agent according to the state data of 96 time periods in the day of the corresponding patch, and updating the parameters of the target network.
4. The method for cooperatively controlling the frequency modulation resources of the transmission and distribution system based on the federal reinforcement learning algorithm according to claim 3, wherein the method comprises the following steps:
in the step S32, training the DQN neural network model by the z-agent with the state data of 96 time periods within a day corresponding to the patch includes:
s321, selecting state data of a time period from state data of 96 time periods in the day as the current state S of the z-number intelligent agent t
S322, current state S based on z number agent t Trial-error is performed by adopting an epsilon-greedy strategy, namely, the control behavior a is selected by using a random strategy with probability epsilon t Selecting current optimal control behavior with probability 1-epsilon
Figure FDA0004031032750000031
a t 、/>
Figure FDA0004031032750000032
Wherein:
Figure FDA0004031032750000033
s323, calculating the magnitude r of the rewarding function after executing the control behavior a according to the selected control behavior a t And the Q value is calculated by the following function:
Q(s t ,a t )=Q(s t ,a t )+η[r t +μmaxQ(s t+1 ,a t+1 )-Q(s t ,a t )] (6);
in the formulas (5) and (6), Q(s) t ,a t ) The Q value of the current network; maxQ(s) t+1 ,a t+1 ) The Q value of the target network; η is the learning rate; mu is a reward attenuation coefficient;
s324, according to the selected control behavior a, acquiring the next state S returned by the environment after the z-number intelligent agent executes the selected control behavior a t+1 Obtaining an empirical sample (s t ,a,r t ,s t+1 ) And the empirical sample (s t ,a,r t ,s t+1 ) Storing the experience playback pool;
s325, updating the current state of the z-number agent to be the next state returned by the environment, and repeating the steps S322-S324 until the experience playback pool is full;
s326, after the experience playback pool is full, extracting omega experience samples from the experience playback pool for calculation, and updating the loss function:
Figure FDA0004031032750000034
/>
in the formula (7): f (F) z As a loss function; r is (r) i,z A reward function for the z-number agent; q(s) z,i ,a z,i ) The Q value of the current network;
Figure FDA0004031032750000041
is the target network Q value.
5. The federal reinforcement learning algorithm-based transmission and distribution frequency modulation resource cooperative control method according to claim 1, wherein the method is characterized by comprising the following steps of:
in the step S3, homomorphic encryption is performed on the information of the locally trained DQN neural network model, and the uploading aggregation center of the encrypted information specifically includes:
s34, each intelligent agent encrypts a corresponding loss function in the locally trained DQN neural network model by using the Paillier full homomorphic encryption public key K to obtain an encrypted loss function;
S35, each agent transmits the encrypted loss function to the aggregation center.
6. The federal reinforcement learning algorithm-based transmission and distribution frequency modulation resource cooperative control method according to claim 5, wherein the method is characterized by comprising the following steps:
in step S4, the aggregation center performs gradient average processing on all the encrypted information, and sends the information after gradient average processing to each agent, each agent receives the information after gradient average processing, and performs subsequent training on the corresponding locally trained DQN neural network model according to the information after gradient average processing, which specifically includes:
s41, the aggregation center calculates and obtains a comprehensive loss function according to the encrypted loss function sent by each intelligent agent
Figure FDA0004031032750000045
Figure FDA0004031032750000042
In formula (8):
Figure FDA0004031032750000043
representing the summation of multiple encrypted loss functions, R y,z Bonus function for agent z +.>
Figure FDA0004031032750000044
The Q value of the target network corresponding to the z-number intelligent agent; q(s) z,i ,a z,i ) The current network Q value corresponding to the z-number intelligent agent; η is the learning rate; mu is a reward attenuation coefficient; y is the total number of agents;
s42, the aggregation center synthesizes the loss function
Figure FDA0004031032750000047
Transmitting to each agent according to the current network and the comprehensive loss function in the corresponding local trained DQN neural network model >
Figure FDA0004031032750000046
Calculating gradient information;
s43, each agent adds a safety mask on the gradient information, and transmits the gradient information with the safety mask to the aggregation center;
s44, after receiving the gradient information added with the security mask, the aggregation center releases homomorphic encryption of the gradient information, and returns a result after releasing homomorphic encryption to the corresponding intelligent agent;
s45, each agent receives the result after the full homomorphic encryption is released, and removes the subnet mask in the result to obtain gradient information without encryption, and each agent updates the corresponding parameter theta of the current network in the local trained DQN neural network model through the gradient information without encryption z,t
7. The federal reinforcement learning algorithm-based transmission and distribution frequency modulation resource cooperative control method according to claim 6, wherein the method is characterized by comprising the following steps:
in the step S45, the corresponding local trained parameters θ of the current network in the DQN neural network model are updated z,t The formula of (2) is:
Figure FDA0004031032750000051
in the formula (9): f is a loss function, θ z,t Updated current network parameters for z-number agent, θ z,t-1 The current network parameters before the z-number agent is updated.
8. The federal reinforcement learning algorithm-based transmission and distribution frequency modulation resource cooperative control method according to claim 2, wherein the method is characterized by comprising the following steps of:
The control behavior in the action space accords with the power supply characteristic constraint condition and the system balance constraint condition.
9. The federal reinforcement learning algorithm-based transmission and distribution frequency modulation resource cooperative control method according to claim 8, wherein the method is characterized by comprising the following steps:
the power supply characteristic constraint condition specifically includes:
thermal power generating unit operation constraint:
Figure FDA0004031032750000052
in the formula (10), the amino acid sequence of the compound,
Figure FDA0004031032750000053
the upper limit and the lower limit of the output of the ith thermal power unit are respectively set; />
Figure FDA0004031032750000054
The climbing rate of the ith thermal power generating unit; />
Figure FDA0004031032750000055
The output of the ith thermal power unit at the moment t; />
Figure FDA0004031032750000056
The output of the ith thermal power unit at the t-1 moment;
energy storage device operation constraints:
Figure FDA0004031032750000061
in the formula (11), the amino acid sequence of the compound,
Figure FDA0004031032750000062
the capacity constraint range of the mth energy storage device; />
Figure FDA0004031032750000063
The capacity of the mth energy storage device at the t moment; />
Figure FDA0004031032750000064
The charging power of the mth energy storage device at the t moment; />
Figure FDA0004031032750000065
A charging power constraint range for the mth energy storage device; />
Figure FDA0004031032750000066
The discharge power of the mth energy storage device at the t moment; />
Figure FDA0004031032750000067
And->
Figure FDA0004031032750000068
The discharge power constraint range of the mth energy storage device; />
Figure FDA0004031032750000069
The battery capacity of the mth energy storage device at the time t+1; />
Figure FDA00040310327500000610
The self-discharge efficiency of the mth energy storage device; />
Figure FDA00040310327500000611
Charging efficiency of the mth energy storage device; />
Figure FDA00040310327500000612
The discharge efficiency of the mth energy storage device;
running constraint of distributed wind farm units:
Figure FDA00040310327500000613
In the formula (12), the amino acid sequence of the compound,
Figure FDA00040310327500000614
the lower limit of the output power of the nth wind driven generator; />
Figure FDA00040310327500000615
The upper limit of the output power of the nth wind driven generator; />
Figure FDA00040310327500000616
The output power of the nth wind driven generator at the t moment;
electric automobile group state constraint:
Figure FDA00040310327500000617
/>
Figure FDA00040310327500000618
Figure FDA00040310327500000619
Figure FDA0004031032750000071
in the formulae (13) to (16),
Figure FDA0004031032750000072
and->
Figure FDA0004031032750000073
The SOC constraint range of the jth electric automobile; />
Figure FDA0004031032750000074
SOC of the j-th electric automobile; />
Figure FDA0004031032750000075
Outputting a constraint range of power increment for an M-th electric vehicle charging station; />
Figure FDA0004031032750000076
The output power increment of the charging station of the M-th electric vehicle; />
Figure FDA0004031032750000077
A period of time for accessing a charging station for a single electric vehicle; />
Figure FDA0004031032750000078
The upper limit of the charging and discharging power of the jth electric automobile at the t moment; />
Figure FDA0004031032750000079
The lower limit of the charging and discharging power of the jth electric automobile at the t moment; />
Figure FDA00040310327500000710
And->
Figure FDA00040310327500000711
Is influenced by the number j of vehicles in the charging station, the SOC capacity of a single electric vehicle and the charge and discharge state factors of the single electric vehicle; />
Figure FDA00040310327500000712
The SOC of the j-th electric automobile at the t moment; />
Figure FDA00040310327500000713
Rated charging power for the j-th electric automobile; />
Figure FDA00040310327500000714
Rated discharge power of the j-th electric automobile; />
Figure FDA00040310327500000715
The output power increment of the charging pile of the M-th electric automobile at the t moment is set; />
Figure FDA00040310327500000716
The upper limit of the output power increment of the charging pile of the M-th electric automobile at the moment t; />
Figure FDA00040310327500000717
The lower limit of the output power increment of the charging pile of the M-th electric automobile at the moment t; / >
Figure FDA00040310327500000718
The power of the j electric automobile at the t moment.
10. The federal reinforcement learning algorithm-based transmission and distribution frequency modulation resource cooperative control method according to claim 8, wherein the method is characterized by comprising the following steps:
the system balance constraint condition specifically comprises:
Figure FDA00040310327500000719
/>
Figure FDA00040310327500000720
Figure FDA0004031032750000081
in the formulae (17) to (19),
Figure FDA0004031032750000082
respectively representing the t moment as the active output of the ith thermal power unit, the mth energy storage device, the nth wind motor group and the jth electric automobile; n (N) g 、N b 、N w 、N e Respectively representing the number of thermal power units, energy storage equipment, wind power units and electric automobiles; p (P) t L Load disturbance at time t is represented; o1 (l) =b is a set of branches with a head-end node being node b; o2 (l) =b is the set of branches with end node being node b; p (P) l,t And Q l,t The active power and the reactive power of the branch circuit l at the time t are respectively; r is (r) l And x l Resistors of the branches l, respectivelyAnd reactance, U 0 The voltage amplitude of the relaxation node at the time t; u (U) b,t The voltage amplitude of the node b at the time t; />
Figure FDA0004031032750000083
Active power of a generator set connected with the node b; />
Figure FDA0004031032750000084
Reactive power of a generator set connected with the node b; />
Figure FDA0004031032750000085
Active power for a load connected to node b; />
Figure FDA0004031032750000086
Reactive power for a load connected to node b; p (P) l,max And P l,min The upper limit and the lower limit of the active power of the branch circuit l are respectively; q (Q) l,max And Q l,min The upper limit and the lower limit of reactive power of the branch I are respectively; u (U) b,max And U b,min The upper and lower limits of the voltage at node b, respectively. />
CN202211728739.5A 2022-12-30 2022-12-30 Transmission and distribution frequency modulation resource cooperative control method based on federal reinforcement learning algorithm Pending CN116054285A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211728739.5A CN116054285A (en) 2022-12-30 2022-12-30 Transmission and distribution frequency modulation resource cooperative control method based on federal reinforcement learning algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211728739.5A CN116054285A (en) 2022-12-30 2022-12-30 Transmission and distribution frequency modulation resource cooperative control method based on federal reinforcement learning algorithm

Publications (1)

Publication Number Publication Date
CN116054285A true CN116054285A (en) 2023-05-02

Family

ID=86119362

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211728739.5A Pending CN116054285A (en) 2022-12-30 2022-12-30 Transmission and distribution frequency modulation resource cooperative control method based on federal reinforcement learning algorithm

Country Status (1)

Country Link
CN (1) CN116054285A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117151308A (en) * 2023-10-30 2023-12-01 国网浙江省电力有限公司杭州供电公司 Comprehensive energy system optimal scheduling method and system based on federal reinforcement learning

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117151308A (en) * 2023-10-30 2023-12-01 国网浙江省电力有限公司杭州供电公司 Comprehensive energy system optimal scheduling method and system based on federal reinforcement learning

Similar Documents

Publication Publication Date Title
Jia et al. Coordinated control for EV aggregators and power plants in frequency regulation considering time-varying delays
CN112117760A (en) Micro-grid energy scheduling method based on double-Q-value network deep reinforcement learning
CN110826880B (en) Active power distribution network optimal scheduling method for large-scale electric automobile access
CN110137981B (en) Distributed energy storage aggregator AGC method based on consistency algorithm
CN107069776A (en) A kind of energy storage prediction distributed control method of smooth microgrid dominant eigenvalues
CN108376990B (en) Control method and system of energy storage power station
CN103904641B (en) The micro-electrical network intelligent power generation of isolated island control method based on correlated equilibrium intensified learning
CN116345577B (en) Wind-light-storage micro-grid energy regulation and optimization method, device and storage medium
CN112381424A (en) Multi-time scale active power optimization decision method for uncertainty of new energy and load
Alfaverh et al. Optimal vehicle-to-grid control for supplementary frequency regulation using deep reinforcement learning
CN107394798A (en) Electric automobile comprising Time-varying time-delays and generator group coordination control method for frequency
CN116054285A (en) Transmission and distribution frequency modulation resource cooperative control method based on federal reinforcement learning algorithm
CN115036963B (en) Two-stage demand response strategy for improving toughness of power distribution network
CN116154826A (en) Charging and discharging load control system for reducing and adjusting heavy overload of distribution transformer area
CN114336694B (en) Energy optimization control method for hybrid energy storage power station
CN116957294A (en) Scheduling method for virtual power plant to participate in electric power market transaction based on digital twin
CN108390387A (en) A kind of source lotus peak regulation control method of dynamic self-discipline decentralized coordinating
CN114566971A (en) Real-time optimal power flow calculation method based on near-end strategy optimization algorithm
CN109149658B (en) Independent micro-grid distributed dynamic economic dispatching method based on consistency theory
CN114039364A (en) Demand opportunity constraint-based distributed battery energy storage cluster frequency modulation method and device
CN109066702A (en) A kind of load bilayer control method based on response potentiality
CN112018798B (en) Multi-time scale autonomous operation method for power distribution network with regional energy storage station participating in disturbance stabilization
CN111327076B (en) Energy storage type fan scheduling response method based on distributed accounting
Skiparev et al. Reinforcement learning based MIMO controller for virtual inertia control in isolated microgrids
CN112165086B (en) Online optimization system of active power distribution network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination