CN115333111A - Multi-region power grid collaborative optimization method, system, equipment and readable storage medium - Google Patents

Multi-region power grid collaborative optimization method, system, equipment and readable storage medium Download PDF

Info

Publication number
CN115333111A
CN115333111A CN202211109903.4A CN202211109903A CN115333111A CN 115333111 A CN115333111 A CN 115333111A CN 202211109903 A CN202211109903 A CN 202211109903A CN 115333111 A CN115333111 A CN 115333111A
Authority
CN
China
Prior art keywords
region
generating unit
agent
thermal power
renewable energy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211109903.4A
Other languages
Chinese (zh)
Inventor
蒲天骄
董雷
韩笑
林灏
王新迎
马世乾
崇志强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Corp of China SGCC
China Electric Power Research Institute Co Ltd CEPRI
State Grid Tianjin Electric Power Co Ltd
North China Electric Power University
Electric Power Research Institute of State Grid Tianjin Electric Power Co Ltd
Original Assignee
State Grid Corp of China SGCC
China Electric Power Research Institute Co Ltd CEPRI
State Grid Tianjin Electric Power Co Ltd
North China Electric Power University
Electric Power Research Institute of State Grid Tianjin Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Corp of China SGCC, China Electric Power Research Institute Co Ltd CEPRI, State Grid Tianjin Electric Power Co Ltd, North China Electric Power University, Electric Power Research Institute of State Grid Tianjin Electric Power Co Ltd filed Critical State Grid Corp of China SGCC
Priority to CN202211109903.4A priority Critical patent/CN115333111A/en
Publication of CN115333111A publication Critical patent/CN115333111A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H02GENERATION; CONVERSION OR DISTRIBUTION OF ELECTRIC POWER
    • H02JCIRCUIT ARRANGEMENTS OR SYSTEMS FOR SUPPLYING OR DISTRIBUTING ELECTRIC POWER; SYSTEMS FOR STORING ELECTRIC ENERGY
    • H02J3/00Circuit arrangements for ac mains or ac distribution networks
    • H02J3/04Circuit arrangements for ac mains or ac distribution networks for connecting networks of the same frequency but supplied from different sources
    • H02J3/06Controlling transfer of power between connected networks; Controlling sharing of load between connected networks
    • HELECTRICITY
    • H02GENERATION; CONVERSION OR DISTRIBUTION OF ELECTRIC POWER
    • H02JCIRCUIT ARRANGEMENTS OR SYSTEMS FOR SUPPLYING OR DISTRIBUTING ELECTRIC POWER; SYSTEMS FOR STORING ELECTRIC ENERGY
    • H02J3/00Circuit arrangements for ac mains or ac distribution networks
    • H02J3/38Arrangements for parallely feeding a single network by two or more generators, converters or transformers
    • H02J3/46Controlling of the sharing of output between the generators, converters, or transformers
    • H02J3/466Scheduling the operation of the generators, e.g. connecting or disconnecting generators to meet a given demand
    • HELECTRICITY
    • H02GENERATION; CONVERSION OR DISTRIBUTION OF ELECTRIC POWER
    • H02JCIRCUIT ARRANGEMENTS OR SYSTEMS FOR SUPPLYING OR DISTRIBUTING ELECTRIC POWER; SYSTEMS FOR STORING ELECTRIC ENERGY
    • H02J3/00Circuit arrangements for ac mains or ac distribution networks
    • H02J3/38Arrangements for parallely feeding a single network by two or more generators, converters or transformers
    • H02J3/46Controlling of the sharing of output between the generators, converters, or transformers
    • H02J3/48Controlling the sharing of the in-phase component
    • HELECTRICITY
    • H02GENERATION; CONVERSION OR DISTRIBUTION OF ELECTRIC POWER
    • H02JCIRCUIT ARRANGEMENTS OR SYSTEMS FOR SUPPLYING OR DISTRIBUTING ELECTRIC POWER; SYSTEMS FOR STORING ELECTRIC ENERGY
    • H02J3/00Circuit arrangements for ac mains or ac distribution networks
    • H02J3/38Arrangements for parallely feeding a single network by two or more generators, converters or transformers
    • H02J3/46Controlling of the sharing of output between the generators, converters, or transformers
    • H02J3/50Controlling the sharing of the out-of-phase component
    • HELECTRICITY
    • H02GENERATION; CONVERSION OR DISTRIBUTION OF ELECTRIC POWER
    • H02JCIRCUIT ARRANGEMENTS OR SYSTEMS FOR SUPPLYING OR DISTRIBUTING ELECTRIC POWER; SYSTEMS FOR STORING ELECTRIC ENERGY
    • H02J2203/00Indexing scheme relating to details of circuit arrangements for AC mains or AC distribution networks
    • H02J2203/10Power transmission or distribution systems management focussing at grid-level, e.g. load flow analysis, node profile computation, meshed network optimisation, active network management or spinning reserve management
    • HELECTRICITY
    • H02GENERATION; CONVERSION OR DISTRIBUTION OF ELECTRIC POWER
    • H02JCIRCUIT ARRANGEMENTS OR SYSTEMS FOR SUPPLYING OR DISTRIBUTING ELECTRIC POWER; SYSTEMS FOR STORING ELECTRIC ENERGY
    • H02J2203/00Indexing scheme relating to details of circuit arrangements for AC mains or AC distribution networks
    • H02J2203/20Simulating, e g planning, reliability check, modelling or computer assisted design [CAD]
    • HELECTRICITY
    • H02GENERATION; CONVERSION OR DISTRIBUTION OF ELECTRIC POWER
    • H02JCIRCUIT ARRANGEMENTS OR SYSTEMS FOR SUPPLYING OR DISTRIBUTING ELECTRIC POWER; SYSTEMS FOR STORING ELECTRIC ENERGY
    • H02J2300/00Systems for supplying or distributing electric power characterised by decentralized, dispersed, or local generation
    • H02J2300/20The dispersed energy generation being of renewable origin
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Landscapes

  • Engineering & Computer Science (AREA)
  • Power Engineering (AREA)
  • Supply And Distribution Of Alternating Current (AREA)

Abstract

A multi-region power grid collaborative optimization method, a system, equipment and a readable storage medium are provided, wherein the optimization method comprises the steps of collecting observation data of each intelligent agent region; constructing a multi-region power grid collaborative optimization model containing renewable energy sources; designing a multi-region power grid collaborative optimization model containing renewable energy sources into a reinforcement learning model under a multi-agent environment according to a state space, an action space, an environment and a reward function; and solving the reinforcement learning model in the multi-agent environment, and outputting a collaborative optimization result. The invention adopts a distributed model with multiple decision centers to reduce the communication pressure of the system, can consider the double uncertainty of renewable energy sources and loads in training, has better capability of coping with the uncertainty, can realize the real-time decision of the system, and can make decisions on controllable units only according to local observation values of each regional power grid after training based on the characteristic of concentrated training-distributed execution of multiple intelligent agents, thereby being beneficial to protecting the privacy of each regional power grid.

Description

Multi-region power grid collaborative optimization method, system, equipment and readable storage medium
Technical Field
The invention belongs to the technical field of regional power grid optimization scheduling, and particularly relates to a multi-region power grid collaborative optimization method, a multi-region power grid collaborative optimization system, multi-region power grid collaborative optimization equipment and a readable storage medium, which are particularly suitable for multi-region power grid collaborative optimization containing high-proportion renewable energy.
Background
Constructing a new power system based on renewable energy becomes an important measure for reducing carbon emission. With the continuous improvement of the permeability of renewable energy sources, a traditional power system which is dominated by a coal-electricity set with continuous controllable output will be converted to a novel power system which is dominated by renewable energy sources with strong uncertainty and weak controllable output, and the conversion makes the internal power and electricity balance of a regional power grid difficult. In order to guarantee safe and stable operation of a power grid, reduce investment of a traditional standby unit and reduce operation cost of the power grid, a plurality of regional power grids are interconnected, and the capability of fully mining the internal autonomy of the regional power grids and information interaction and collaborative optimization between regions has great significance. At present, a centralized optimization method is mainly adopted for the problem of collaborative optimization of a regional power grid. The centralized optimization method needs to collect data of the whole system, then makes a decision through a scheduling center, and sends the decision to each execution unit to complete the optimization of the whole system. However, the permeability of distributed power generation of the novel power system is continuously improved, and the operation mode is changeable, so that the controllability of the novel power system is reduced, and the global operation information is difficult to collect.
Therefore, the traditional centralized control method has the problems of large data acquisition amount, high communication cost and complex modeling, has certain limitations on coping with uncertainty and solving efficiency, and is difficult to be applied to the control of a complex system containing a plurality of distributed power supplies on line.
Object of the Invention
The invention aims to provide a method, a system, equipment and a readable storage medium for collaborative optimization of a multi-region power grid aiming at the problems in the prior art, and provides a more economic, accurate and reliable solution for the collaborative optimization of the multi-region power grid containing high-proportion renewable energy based on a multi-agent deep reinforcement learning technology.
In order to achieve the purpose, the invention has the following technical scheme:
in a first aspect, a multi-region power grid collaborative optimization method is provided, including:
collecting observation data of each intelligent agent region in a multi-region power grid to be optimized;
constructing a multi-region power grid collaborative optimization model containing renewable energy sources on the basis of the observation data;
designing the multi-region power grid collaborative optimization model containing renewable energy sources into a reinforcement learning model under a multi-agent environment according to a state space, an action space, an environment and a reward function;
and solving the reinforcement learning model in the multi-agent environment, and outputting a collaborative optimization result to carry out collaborative optimization on the multi-region power grid.
The optimal scheme of the multi-region power grid collaborative optimization method further comprises the step of dividing the power grid into intelligent bodies, wherein when the power grid is divided into the intelligent bodies, the node standard system is divided into different regions, the different intelligent bodies are set according to the regions, and the regional intelligent bodies are used as decision centers to realize collaborative optimization operation of the multi-region power grid.
As a preferred solution of the multi-region grid collaborative optimization method in the present invention, in the step of collecting observation data of each agent region in the multi-region grid to be optimized, the observation data of each agent region includes:
load data p L Renewable energy source actual output power
Figure BDA0003843564830000021
Actual output power of thermal power generating unit
Figure BDA0003843564830000022
And the operating cost coefficient c of the thermal power generating unit o Cost coefficient of wind and light abandoning c g Cost coefficient of starter SU i Shutdown cost factor SD i
As a preferred scheme of the multi-region grid collaborative optimization method, in the step of constructing the multi-region grid collaborative optimization model containing renewable energy, the following objective function is established:
Figure BDA0003843564830000023
wherein N is the number of divided regions,
Figure BDA0003843564830000024
for regional n thermal power generating unitsThe cost of the process is reduced, and the cost is reduced,
Figure BDA0003843564830000025
a wind and light abandoning punishment is carried out on the wind and light abandoning of the renewable energy source unit in the region n; wherein the content of the first and second substances,
Figure BDA0003843564830000026
and
Figure BDA0003843564830000027
respectively as follows:
Figure BDA0003843564830000028
in the formula, M k The number of the fire-electricity generating sets in the region n; t is the calculation duration;
Figure BDA0003843564830000029
the method is characterized in that the method is an operation state of a thermal power generating unit i in a time period t:
Figure BDA00038435648300000210
the operation of the thermal power generating unit i is shown,
Figure BDA00038435648300000211
indicating that the thermal power generating unit i is shut down; c. C o The coefficient is the operation cost coefficient of the thermal power generating unit;
Figure BDA00038435648300000212
the power output by the thermal power generating unit i in the time period t; delta t is an operation time interval; SU i The cost for starting the thermal power generating unit i once;
Figure BDA0003843564830000031
the shutdown cost of the thermal power generating unit i is calculated;
Figure BDA0003843564830000032
the shutdown time of the thermal power generating unit i is shown;
Figure BDA0003843564830000033
wherein, M n The number of renewable energy units in the region n; t is the calculated time length; c. C g A light abandoning penalty coefficient for abandoning wind;
Figure BDA0003843564830000034
actual output power of the renewable energy source unit j in the time period t;
Figure BDA0003843564830000035
and (4) the output power of the renewable energy set j in the time period t is limited.
As a preferred scheme of the multi-region grid collaborative optimization method, in the step of constructing the multi-region grid collaborative optimization model containing renewable energy, equality constraints and inequality constraints are established on the objective function;
equality constraints including network power flow constraints and power balance constraints;
the expression of the network power flow constraint is as follows:
Figure BDA0003843564830000036
Figure BDA0003843564830000037
in the formula, P G,i 、Q G,i Active and reactive power injected from the node i; p D,i 、Q D,i Active and reactive power consumed for the load of the node i; n is a radical of e Counting the nodes of the power distribution network; u shape i 、U j The voltage amplitudes of the node i and the node j are obtained; g ij 、B ij Respectively conductance and susceptance; theta.theta. ij Is the phase angle difference between the node i and the node j;
the expression of the power balance constraint is as follows:
Figure BDA0003843564830000038
in the formula, M r As the number of loads in the region n,
Figure BDA0003843564830000039
power for load r over time period t;
inequality constraints comprising upper limit constraint of branch current, upper and lower limit constraint of node voltage, upper and lower limit constraint of output power of a thermal power generating unit, startup and shutdown time constraint of the thermal power generating unit, climbing constraint of the thermal power generating unit, rotating standby constraint of the thermal power generating unit and upper and lower limit constraint of output power of a renewable energy source unit;
the expression of the branch current upper limit constraint is as follows:
Figure BDA0003843564830000041
in the formula I ij The magnitude of the current flowing on branch ij,
Figure BDA0003843564830000042
is the maximum amplitude of the current allowed to flow on branch ij;
the expression of the upper and lower node voltage limit constraints is as follows:
Figure BDA0003843564830000043
wherein the content of the first and second substances,
Figure BDA0003843564830000044
U i respectively representing the upper limit and the lower limit of the voltage of the node i;
the expression of the upper and lower limit constraints of the output power of the thermal power generating unit is as follows:
Figure BDA0003843564830000045
wherein the content of the first and second substances,
Figure BDA0003843564830000046
P t i respectively representing the upper limit and the lower limit of the output power of the thermal power generating unit i;
the expression of the thermal power generating unit on-off time constraint is as follows:
Figure BDA0003843564830000047
in the formula (I), the compound is shown in the specification,
Figure BDA0003843564830000048
starting time, stopping time, minimum starting time and minimum stopping time of the thermal power generating unit i;
the climbing constraint expression of the thermal power generating unit is as follows:
Figure BDA0003843564830000049
in the formula of U t 、D n The maximum climbing and descending speed of active power of the thermal power generating unit i in unit time;
the expression of the rotating standby constraint of the thermal power generating unit is as follows:
Figure BDA00038435648300000410
Figure BDA00038435648300000411
in the formula (I), the compound is shown in the specification,
Figure BDA00038435648300000412
for the load reserve capacity of node R at time t, R w Predicting an error for the output of the renewable energy unit j; the expression of the upper and lower limits of the output power of the renewable energy source set is as follows:
Figure BDA0003843564830000051
in the formula (I), the compound is shown in the specification,
Figure BDA0003843564830000052
and the upper limit and the lower limit of the output power of the renewable energy source unit j.
As a preferred embodiment of the multi-region power grid collaborative optimization method of the present invention, in the step of designing the multi-region power grid collaborative optimization model containing renewable energy into a reinforcement learning model in a multi-agent environment according to a state space, an action space, an environment and a reward function, the state space variable includes load data p of each agent region L Actual output power of renewable energy
Figure BDA0003843564830000053
Renewable energy output power upper limit
Figure BDA0003843564830000054
Actual output power of thermal power generating unit
Figure BDA0003843564830000055
Operating cost coefficient c of thermal power generating unit o Cost coefficient of wind and light abandoning c g Cost coefficient of starter SU i Shutdown cost factor SD i And node voltage U i (ii) a The expression of the state space is as follows:
Figure BDA0003843564830000056
the action space variables include: output power of thermal power generating unit
Figure BDA0003843564830000057
Start and stop of thermal power generating unit
Figure BDA0003843564830000058
Renewable energy output power
Figure BDA0003843564830000059
The expression of the action space is as follows:
Figure BDA00038435648300000510
in a multi-region power grid collaborative optimization model containing renewable energy, equality constraint and inequality constraint established for the objective function are taken as an environment, power flow calculation is performed once after each intelligent agent takes action at each moment, relevant state quantity of a feedback power grid is used for calculating a reward function, and the calculation is transferred to the next moment, and the steps are performed in a circulating manner;
in a multi-region power grid collaborative optimization model containing renewable energy sources, the opposite number of the objective function is used as the instant reward of each intelligent agent; according to the equality constraint and the inequality constraint established for the target function, if the corresponding variable does not satisfy the constraint, a penalty value r is set push And the real-time reward is taken as a final reward function of the intelligent agent, and the expression is as follows:
R={F+r push }。
as an optimal scheme of the multi-region power grid collaborative optimization method, in the step of solving the reinforcement learning model in the multi-agent environment, the MADDPG algorithm is adopted for solving, and in the MADDPG algorithm, the Actor of the agent i acquires the state information s related to the Actor i ,a i Actions taken for agent i, r i For awarding, θ i For the weight parameter of the system, n agents are arranged, and an observation set x =(s) 1 ,...,s N ) Status information for all agents; the Actor continuously updates the self parameter theta i To maximize the expectation of the own prize, i.e., the Critic evaluation value is higher;
the policy update rule expression of Actor is as follows:
Figure BDA0003843564830000061
in the formula (I), the compound is shown in the specification,
Figure BDA0003843564830000062
for each agent, as a centralized state-action value function
Figure BDA0003843564830000063
Independently learning and updating; d = (x, x', a) 1 ,...,a N ,r 1 ,...,r N ) Randomly selecting one group from the playback units for storing all the experiences of the intelligent agents to train during each training; for continuous motion
Figure BDA0003843564830000064
The MADDPG algorithm adopts the continuous determination strategy set mu of n agents; for 0-1 motion
Figure BDA0003843564830000065
Random values are taken in the training stage;
critic updates its parameters by minimizing the time difference error, and the loss function expression of Critic is:
Figure BDA0003843564830000066
in the formula (I), the compound is shown in the specification,
Figure BDA0003843564830000067
mu' is the strategy set of the target network, and gamma belongs to [0,1 ]]Is a discount factor;
the target network periodically copies the parameters from the evaluation network, with the following rules:
θ′ i =(1-τ)θ′ i +τθ i
in formula (II), theta' i Tau is a soft update coefficient and tau is less than or equal to 1;
let strategy μ of agent i i There is a set of K sub-policies,using only one sub-strategy mu in each training round i (k) In the whole training process, the overall reward of the strategy set is the highest, and the final strategy update of the Actor is as follows:
Figure BDA0003843564830000068
as an optimal scheme of the multi-region power grid collaborative optimization method, the step of solving the reinforcement learning model in the multi-agent environment by adopting the MADDPG algorithm specifically comprises the following steps:
setting an optimized scheduling time period T and the number M of training rounds of each agent, setting initial values to be 1, and randomly setting an agent network parameter theta at initial time i
Loading the multi-region power grid collaborative optimization model into an environment, setting interface files of states and actions in the MADDPG algorithm so as to carry out load flow calculation according to the state actions in real time and feed back corresponding environment state quantities;
each agent observes the state quantity of the region and takes action, interacts with the environment and feeds back the state quantity to calculate the reward, the agent carries out state transition to enter the next moment, observes the state of the next moment and stores (x, a, r, x') into an experience playback unit D;
randomly sampling k strategy next group (x) from empirical playback unit D (k) ,a (k) ,r (k) ,x′ (k) ) Updating critic and operator parameters and target network parameters;
and judging whether the current training round number M reaches a set value M, if so, ending the training, outputting and storing a result, and if not, restarting a new round of training.
In a second aspect, a multi-region grid collaborative optimization system is provided, including:
the observation data collection module is used for collecting observation data of each intelligent agent region in the multi-region power grid to be optimized;
the collaborative optimization model building module is used for building a multi-region power grid collaborative optimization model containing renewable energy sources on the basis of the observation data;
the reinforcement learning model design module is used for designing the multi-region power grid collaborative optimization model containing the renewable energy into a reinforcement learning model under a multi-agent environment according to a state space, an action space, an environment and a reward function;
and the model solving module is used for solving the reinforcement learning model in the multi-agent environment and outputting a collaborative optimization result to carry out collaborative optimization on the multi-area power grid.
As a preferred scheme of the multi-region power grid collaborative optimization system, the multi-region power grid collaborative optimization system further comprises an intelligent agent partitioning module, wherein the intelligent agent partitioning module is used for performing intelligent agent partitioning on the power grid;
when the power grid is divided into different zones, the node standard system is divided into different zones, different intelligent agents are set according to the zones, and the zone intelligent agents are used as decision centers, so that collaborative optimization operation of the multi-zone power grid is achieved.
As a preferable solution of the multi-region grid collaborative optimization system of the present invention, the observation data collected by the observation data collection module for each agent region includes:
load data p L Renewable energy source actual output power
Figure BDA0003843564830000081
Actual output power of thermal power generating unit
Figure BDA0003843564830000082
And the operating cost coefficient c of the thermal power generating unit o Cost coefficient of wind and light abandoning c g Cost coefficient of starter SU i Shutdown cost factor SD i
As an optimal scheme of the multi-region grid collaborative optimization system, the collaborative optimization model building module establishes the following objective function:
Figure BDA0003843564830000083
wherein N is the number of divided regions,
Figure BDA0003843564830000084
in order to reduce the cost of the thermal power generating unit in the region n,
Figure BDA0003843564830000085
a wind and light abandoning punishment is carried out on the wind and light abandoning of the renewable energy source unit in the region n; wherein the content of the first and second substances,
Figure BDA0003843564830000086
and
Figure BDA0003843564830000087
respectively as follows:
Figure BDA0003843564830000088
in the formula, M k The number of the fire-electricity generating sets in the region n; t is the calculated time length;
Figure BDA0003843564830000089
the method is characterized in that the method is an operation state of a thermal power generating unit i in a time period t:
Figure BDA00038435648300000810
the operation of the thermal power generating unit i is shown,
Figure BDA00038435648300000811
indicating that the thermal power generating unit i is shut down; c. C o The coefficient is the operation cost coefficient of the thermal power generating unit;
Figure BDA00038435648300000812
the power output by the thermal power generating unit i in the time period t; Δ t is an operating time interval; SU i The cost for starting the thermal power generating unit i once;
Figure BDA00038435648300000813
the shutdown cost of the thermal power generating unit i is calculated;
Figure BDA00038435648300000814
the shutdown time of the thermal power generating unit i is shown;
Figure BDA00038435648300000815
wherein M is n The number of renewable energy source units in the region n; t is the calculation duration; c. C g A light abandoning penalty coefficient for wind abandon;
Figure BDA00038435648300000816
the actual output power of the renewable energy source unit j in the time period t is obtained;
Figure BDA00038435648300000817
and (4) the output power of the renewable energy set j in the time period t is limited.
As a preferred scheme of the multi-region power grid collaborative optimization system, the collaborative optimization model building module builds equality constraints and inequality constraints on the objective function;
equality constraints including network power flow constraints, power balance constraints;
the expression of the network flow constraint is as follows:
Figure BDA0003843564830000091
Figure BDA0003843564830000092
in the formula, P G,i 、Q G,i Active and reactive power injected from the node i; p is D,i 、Q D,i Active and reactive power consumed for node i load; n is a radical of e Counting the nodes of the power distribution network; u shape i 、U j The voltage amplitudes of the node i and the node j are obtained; g ij 、B ij Respectively a conductance and a susceptance; theta ij Is the phase angle difference between the node i and the node j;
the expression of the power balance constraint is as follows:
Figure BDA0003843564830000093
in the formula, M r As the number of loads in the region n,
Figure BDA0003843564830000094
power for load r over time period t;
inequality constraints comprising upper limit constraint of branch current, upper and lower limit constraint of node voltage, upper and lower limit constraint of output power of a thermal power generating unit, startup and shutdown time constraint of the thermal power generating unit, climbing constraint of the thermal power generating unit, rotating standby constraint of the thermal power generating unit and upper and lower limit constraint of output power of a renewable energy source unit;
the branch current upper limit constraint is expressed as follows:
Figure BDA0003843564830000095
in the formula I ij The magnitude of the current flowing on branch ij,
Figure BDA0003843564830000096
is the maximum amplitude of the current allowed to flow on branch ij;
the expression of the upper and lower node voltage limits constraint is as follows:
Figure BDA0003843564830000097
wherein the content of the first and second substances,
Figure BDA0003843564830000098
U i respectively representing the upper limit and the lower limit of the voltage of the node i;
the expression of the upper and lower limit constraints of the output power of the thermal power generating unit is as follows:
Figure BDA0003843564830000099
wherein the content of the first and second substances,
Figure BDA00038435648300000910
P t i respectively representing the upper limit and the lower limit of the output power of the thermal power generating unit i;
the expression of the thermal power generating unit on-off time constraint is as follows:
Figure BDA0003843564830000101
in the formula (I), the compound is shown in the specification,
Figure BDA0003843564830000102
starting time, stopping time, minimum starting time and minimum stopping time of the thermal power generating unit i;
the climbing constraint expression of the thermal power generating unit is as follows:
Figure BDA0003843564830000103
in the formula of U t 、D n The maximum climbing and descending speed of active power of the thermal power generating unit i in unit time;
the expression of the rotating standby constraint of the thermal power generating unit is as follows:
Figure BDA0003843564830000104
Figure BDA0003843564830000105
in the formula (I), the compound is shown in the specification,
Figure BDA0003843564830000106
for the load reserve capacity of node r at time t,R w predicting an error for the output of the renewable energy unit j; the expression of the upper and lower limits of the output power of the renewable energy source set is as follows:
Figure BDA0003843564830000107
in the formula (I), the compound is shown in the specification,
Figure BDA0003843564830000108
and the upper limit and the lower limit of the output power of the renewable energy source unit j.
As a preferable solution of the multi-region power grid collaborative optimization system of the present invention, the reinforcement learning model design module is configured to design the multi-region power grid collaborative optimization model containing renewable energy into a reinforcement learning model in a multi-agent environment according to a state space, an action space, an environment and a reward function, where the state space variables include load data p of each agent region L Renewable energy source actual output power
Figure BDA0003843564830000109
Upper limit of renewable energy output power
Figure BDA00038435648300001010
Actual output power of thermal power generating unit
Figure BDA00038435648300001011
Coefficient of operating cost c of thermal power generating unit o Cost coefficient of wind and light abandoning c g Cost coefficient of starter SU i Shutdown cost factor SD i And node voltage U i (ii) a The expression of the state space is as follows:
Figure BDA00038435648300001012
the action space variables include: output power of thermal power generating unit
Figure BDA00038435648300001013
Start and stop of thermal power generating unit
Figure BDA00038435648300001014
Renewable energy output power
Figure BDA0003843564830000111
The expression of the action space is as follows:
Figure BDA0003843564830000112
in a multi-region power grid collaborative optimization model containing renewable energy, equality constraint and inequality constraint established for the objective function are taken as an environment, power flow calculation is performed once after each intelligent agent takes action at each moment, relevant state quantity of a feedback power grid is used for calculating a reward function, and the calculation is transferred to the next moment, and the steps are performed in a circulating manner;
in a multi-region power grid collaborative optimization model containing renewable energy sources, the opposite number of the objective function is used as the instant reward of each intelligent agent; according to the equality constraint and the inequality constraint established for the target function, if the corresponding variable does not meet the constraint, a penalty value r is set push And as a final reward function for the agent along with the instant reward, the expression is as follows:
R={F+r push }。
as an optimal solution of the multi-region power grid collaborative optimization system, the model solving module adopts a madpg algorithm to solve the reinforcement learning model in the multi-agent environment, and in the madpg algorithm, an Actor of an agent i obtains state information s related to the Actor i ,a i Actions taken for agent i, r i For awarding, θ i For the weight parameter of the agent, n agents are arranged, and the observation set x =(s) 1 ,...,s N ) Status information for all agents; the Actor continuously updates the self parameter theta i To maximize the expected value of the own prize, i.e. toCritic's evaluation value is higher;
the policy update rule expression of Actor is as follows:
Figure BDA0003843564830000113
in the formula (I), the compound is shown in the specification,
Figure BDA0003843564830000114
for each agent, as a centralized state-action value function
Figure BDA0003843564830000115
Independently learning and updating; d = (x, x', a) 1 ,...,a N ,r 1 ,...,r N ) Randomly selecting one group from the playback units for storing all the experiences of the intelligent agents to train during each training; for continuous motion
Figure BDA0003843564830000116
The MADDPG algorithm adopts the continuous determination strategy set mu of n agents; for 0-1 motion
Figure BDA0003843564830000117
Random values are taken in the training stage;
critic updates its parameters by minimizing the time difference error, and the loss function expression of Critic is:
Figure BDA0003843564830000121
in the formula (I), the compound is shown in the specification,
Figure BDA0003843564830000122
mu' is the strategy set of the target network, gamma belongs to [0,1 ]]Is a discount factor;
the target network periodically copies the parameters from the evaluation network, with the following rules:
θ′ i =(1-τ)θ′ i +τθ i
in the formula (II), theta' i Is a target network parameter of an agent i, tau is a soft update coefficient, and tau is less than or equal to 1;
let strategy μ of agent i i With a set of K sub-strategies, only one sub-strategy mu being used in each training round i (k) In the whole training process, the overall reward of the strategy set is the highest, and the final strategy update of the Actor is as follows:
Figure BDA0003843564830000123
as a preferred embodiment of the multi-region power grid collaborative optimization system of the present invention, the step of solving the reinforcement learning model in the multi-agent environment by the model solving module using madpg algorithm specifically includes:
setting an optimized scheduling time period T and the number M of training rounds of each agent, setting initial values to be 1, and randomly setting an agent network parameter theta at initial time i
Loading the multi-region power grid collaborative optimization model into an environment, setting an interface file of states and actions in an MADDPG algorithm, so that load flow calculation can be performed in real time according to the state actions, and feeding back corresponding environment state quantities;
each agent observes the state quantity of the region and takes action, interacts with the environment and feeds back the state quantity to calculate the reward, the agent carries out state transition to enter the next moment, observes the state of the next moment and stores (x, a, r, x') into an experience playback unit D;
randomly sampling k strategy next group (x) from empirical playback unit D (k) ,a (k) ,r (k) ,x′ (k) ) Updating critic and actor parameters and updating target network parameters;
and judging whether the current training round number M reaches a set value M, if so, ending the training, outputting and storing a result, and if not, restarting a new round of training.
In a third aspect, an electronic device is provided, including:
a memory storing at least one instruction; and
and the processor executes the instructions stored in the memory to realize the multi-region power grid collaborative optimization method.
In a fourth aspect, a computer-readable storage medium is provided, where a computer program is stored, and the computer program is executed by a processor to implement the multi-region grid collaborative optimization method.
Compared with the prior art, the first aspect of the invention has at least the following beneficial effects:
the invention relates to a multi-region power grid collaborative optimization method based on multi-agent deep reinforcement learning, which mainly aims at solving the problems of complex modeling, higher optimization solving difficulty and the like caused by the characteristics of high-dimensional non-convex nonlinearity presented by the multi-region collaborative optimization problem in the existing power grid, simultaneously considers uncertain factors such as renewable energy and the like, constructs a multi-region power grid collaborative optimization model containing high-proportion renewable energy, adopts a multi-agent deep certainty strategy gradient method, designs the optimization problem into a distributed optimization problem under a multi-agent reinforcement learning environment, and can effectively solve the high-dimensional non-convex nonlinear multi-region power grid collaborative optimization problem. Compared with the prior power grid optimal scheduling method, the method has the advantages that: (1) The distributed model with multiple decision centers is adopted to reduce the communication pressure of the system, and simultaneously, the result which is nearly consistent with centralized optimization can be achieved. (2) The optimization method provided by the method can consider double uncertainty of renewable energy sources and loads in training, and according to the self-adaptability of the algorithm, compared with the traditional iterative solution method, the trained model has better capability of coping with uncertainty, can realize real-time decision of the system, and is favorable for online application. (3) Based on the characteristic of 'centralized training-distributed execution' of multiple intelligent agents, each regional power grid makes a decision on the controllable unit only according to local observation values of each regional power grid after training is completed, and privacy of each regional power grid is protected.
It is understood that the beneficial effects of the second to fourth aspects can be seen from the description of the first aspect, and are not described herein again.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
FIG. 1 is a power grid topology structure for a multi-region power grid collaborative optimization method based on multi-agent deep reinforcement learning;
FIG. 2 is a flowchart of a multi-region power grid collaborative optimization method based on multi-agent deep reinforcement learning according to an embodiment of the present invention;
FIG. 3 is a flowchart of the overall algorithm solution of the IES distributed optimization model based on MADDPG according to the embodiment of the present invention;
FIG. 4 is a block diagram of a multi-region power grid collaborative optimization system based on multi-agent deep reinforcement learning according to an embodiment of the present invention.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.
Aiming at the problem that the existing multi-region power grid collaborative optimization problem solution method for containing high-proportion renewable energy is not economical, accurate and reliable, a multi-region power grid collaborative optimization method based on multi-agent deep reinforcement learning is provided, and compared with a centralized optimization method and a traditional distributed optimization method, the method (1) provided by the invention adopts a multi-region power grid collaborative optimization model with multiple decision centers, can fully consider complex constraint conditions in each region power grid, reduces the communication pressure of a system, reduces the decision dimension and establishes a high-precision region power grid optimization model; (2) The method is based on multi-agent deep reinforcement learning, the optimization problem is designed to be a distributed optimization problem under the multi-agent reinforcement learning environment, the uncertainty of renewable energy sources and load under a fluctuation scene can be considered in training, real-time decision of a system can be realized, and online application is facilitated. (3) Based on the characteristic of 'centralized training-distributed execution' of multiple intelligent agents, after training is completed, each regional power grid only needs to make a decision on a controllable unit according to local observation values of each regional power grid, and privacy of each regional power grid is protected.
The invention relates to a regional power grid collaborative optimization method based on multi-agent deep reinforcement learning, wherein the data related to the observation data of each agent region are collected, and the data mainly comprise renewable energy unit output data, thermal power unit output data, load data and the like; on the basis, a multi-region power grid collaborative optimization model oriented to high-proportion renewable energy is constructed, the economy of all region power grids in the system is taken as an optimization target, and the safe and stable operation of the system is taken as a constraint; a Multi-Agent Deep Deterministic Policy Gradient Method (MADDPG) is adopted, and the model is designed into a reinforcement learning model under the Multi-Agent environment according to a state space, an action space, an environment and a reward function; and finally, programming and solving the model by using simulation software, comparing the result obtained by solving with other methods, verifying that the trained model not only has better capability of coping with uncertainty, but also can realize real-time decision of the system, and simultaneously has higher convergence speed and better training result in the training process, thereby having obvious advantages in coping with complex environment problems.
The invention provides a multi-region power grid collaborative optimization method which mainly comprises the following steps:
1. carrying out intelligent division on a power grid;
taking IEEE39 node standard system as an example, and taking the following standards as division bases: 1) The connection between the areas is a single loop as much as possible; 2) The structure between the areas is clear, and the flow direction of the tide is single; 3) The power grid in the group is basically unconstrained and has strong structuredness. The IEEE39 node standard system is divided into different areas, different intelligent agents are set according to the areas, the area intelligent agents are used as decision centers, the collaborative optimization operation of a multi-area power grid is achieved, and the area division condition is shown in figure 1.
2. The multi-region power grid collaborative optimization method based on multi-agent deep reinforcement learning comprises the following steps:
collecting observation data of each agent region, the observation data of each agent related to the invention comprises: the load data, the renewable energy source actual output power, the thermal power unit actual output power and the cost coefficient comprise a thermal power unit operation cost coefficient, a wind and light abandoning cost coefficient, a starting cost coefficient and a shutdown cost coefficient; each regional intelligent agent meets the demand of load in the region and the maximum consumption of renewable energy sources through generating sets (mainly thermal power generating sets and renewable energy generating sets) on one hand and energy interaction between regions on the other hand, and realizes the distribution autonomy of each region to a certain extent, wherein the model is as follows:
Figure BDA0003843564830000151
the total economy of all regional power grids in the system is taken as an optimization target, the cost of each regional power grid is set as the cost of a thermal power generating unit and the wind and light abandoning punishment of renewable energy sources, and N is the number of divided regions.
Figure BDA0003843564830000161
In order to reduce the cost of the thermal power generating unit in the region n,
Figure BDA0003843564830000162
and (4) discarding light punishment for wind curtailment of the regional n renewable energy unit.
Determining the constraint conditions of the optimization model according to the constructed multi-region power grid collaborative optimization model containing the high-proportion renewable energy sources:
the equality constraint conditions of the system mainly comprise network power flow constraint and power balance constraint. The inequality constraint of the system mainly aims at ensuring the safe and stable operation of the system, and mainly comprises branch current upper limit constraint, node voltage upper and lower limit constraint, thermal power unit output power upper and lower limit constraint, thermal power unit startup and shutdown time constraint, thermal power unit climbing constraint, thermal power unit rotation standby constraint and renewable energy unit output power upper and lower limit constraint.
Based on basic elements of multi-agent deep reinforcement learning, a regional power grid collaborative optimization model containing high-proportion renewable energy is designed into a reinforcement learning model under a multi-agent environment according to a state space, an action space, an environment and a reward function, and a MADDPG method is adopted for solving.
The solution is realized by using computer software, and compared with the centralized optimization solution, the trained model is verified to have better capability of coping with uncertainty, and the real-time decision of the system can be realized, so that the online application is facilitated; compared with a single agent deep reinforcement learning method, the MADDPG method has the advantages of high convergence rate, good training result and obvious response to complex environment problems.
3. The MADDPG method adopts a mode of 'centralized training and distributed execution', each intelligent agent is provided with an independent Actor (Actor) and Critic (Critic) network, different from the DDPG, during training, the Actor of each intelligent agent takes action according to the state of the Actor, critic evaluates the action of the Actor, the Actor updates the strategy of the Actor according to feedback, and Critic of each intelligent agent obtains more accurate evaluation information by performing strategy estimation on other intelligent agents. After the training is finished, each intelligent agent only needs to utilize the Actor to take action according to the state of the intelligent agent, at the moment, the information of other intelligent agents does not need to be acquired, and each intelligent agent independently finishes decision making. The MADDPG obtains the optimal strategy through centralized training and learning, only local information is needed when the MADDPG is applied, and the privacy information of each agent can be effectively protected.
4. In the multi-region power grid collaborative optimization method based on multi-agent deep reinforcement learning, double uncertainties of renewable energy sources and loads are considered in the model training process, uncertainties are superposed while data are input in the model training process, namely, in the MADDPG training process, each regional power grid agent needs to obtain observable state data in a regional range of the regional power grid agent, and at the moment of reading the renewable energy sources and the load data, unknown data, namely a fluctuation range, is correspondingly superposed on the obtained data, so that the data obey normal distribution, therefore, the method can ensure that the renewable energy sources and the load data obtained each time are different no matter how many times the training is, but the data are in the fluctuation range, and the size of the fluctuation range can be flexibly adjusted.
Example 1
Referring to fig. 2, the multi-region power grid collaborative optimization method based on multi-agent deep reinforcement learning in the embodiment of the present invention includes the following steps:
s01, collecting observation data of each intelligent agent region, wherein the observation data of each intelligent agent comprises load data, renewable energy actual output power, thermal power unit actual output power, and cost coefficients comprising a thermal power unit operation cost coefficient, a wind and light abandoning cost coefficient, a starter cost coefficient and a shutdown cost coefficient;
s02, constructing a multi-region power grid collaborative optimization model containing high-proportion renewable energy, taking the total economy of all region power grids in the system as an optimization target, setting the cost of each region power grid as the cost of a thermal power unit and the wind and light abandoning punishment of the renewable energy, wherein the equality constraint condition of the system mainly comprises a network power flow constraint and a power balance constraint, the inequality constraint mainly aims at ensuring the safe and stable operation of the system, and mainly comprises an upper limit constraint, a lower limit constraint, a climbing constraint and a rotating standby constraint of variables;
s03, designing a multi-region power grid collaborative optimization model containing high-proportion renewable energy sources into a multi-agent environment reinforcement learning model according to a state space, an action space, an environment and a reward function based on basic elements of multi-agent deep reinforcement learning;
s04, solving by adopting an MADDPG method according to the constructed multi-area power grid collaborative optimization model based on multi-agent deep reinforcement learning, and giving a solving flow;
s05, solving the model by using computer software, and verifying that the trained model not only has better capability of coping with uncertainty, but also can realize real-time decision of a system by comparing with centralized optimization solution, thereby being beneficial to online application; compared with a single agent deep reinforcement learning method, the MADDPG method has the advantages of high convergence rate, good training result and obvious response to complex environment problems.
In one possible implementation, the observation data for each agent region is first collected, and the observation data for each agent involved in the example includes load data p L Actual output power of renewable energy
Figure BDA0003843564830000181
Actual output power of thermal power generating unit
Figure BDA0003843564830000182
Coefficient of operating cost c of thermal power generating unit o Cost coefficient c of abandoned wind and abandoned light g Cost coefficient of starter SU i Shutdown cost factor SD i
Figure BDA0003843564830000183
Then, constructing a multi-region power grid collaborative optimization model containing high-proportion renewable energy sources, comprising the following steps:
step 2.1: and establishing an objective function.
The method comprises the following steps of constructing a multi-region power grid collaborative optimization model containing high-proportion renewable energy, taking an IEEE39 node standard system as an example, dividing the multi-region power grid collaborative optimization model into 3 region power grids, taking the overall economy of all regions in the system as an optimization target, setting the cost of each region power grid as the cost of a thermal power generating unit and the wind and light abandoning punishment of the renewable energy, wherein the target function of the system is as follows:
Figure BDA0003843564830000184
where N is the number of divided regions.
Figure BDA0003843564830000185
In order to reduce the cost of the thermal power generating unit in the region n,
Figure BDA0003843564830000186
and (4) abandoning the light punishment for the wind abandoning of the regional n renewable energy units.
In the formula (3), the reaction mixture is,
Figure BDA0003843564830000187
wherein M is k The number of the thermoelectric generator sets in the region n. And T is the calculation time length.
Figure BDA0003843564830000188
For the operating state of the thermal power generating unit i in the time period t,
Figure BDA0003843564830000189
representing the operation of a thermal power generating unit i;
Figure BDA00038435648300001810
indicating that the thermal power generating unit i is shut down. c. C o Is an operating cost factor.
Figure BDA00038435648300001811
And (4) outputting the power of the thermal power generating unit i in the time period t. Δ t is the operating period interval. SU i The cost for starting the thermal power generating unit i once.
Figure BDA00038435648300001812
The shutdown cost of the thermal power generating unit i is reduced.
Figure BDA00038435648300001813
The shutdown time of the thermal power generating unit i.
In the formula (3), the reaction mixture is,
Figure BDA00038435648300001814
wherein, M n The number of renewable energy units in the region n; t is the calculation duration; c. C g A light abandoning penalty coefficient for abandoning wind;
Figure BDA0003843564830000191
the actual output power of the renewable energy source unit j in the time period t is obtained;
Figure BDA0003843564830000192
and (4) the output power of the renewable energy set j in the time period t is limited.
Step 2.2: and establishing constraint conditions of the optimization model.
The constraint conditions of the multi-region power grid collaborative optimization model containing the high-proportion renewable energy comprise equality constraint conditions and inequality constraint conditions.
1. The equation constrains:
the equality constraint conditions of the system mainly comprise network power flow constraint and power balance constraint.
The network flow constraints are as follows:
Figure BDA0003843564830000193
in the formula, P G,i 、Q G,i Active and reactive power injected from the node i; p D,i 、Q D,i Active and reactive power consumed for the load of the node i; n is a radical of e Counting the nodes of the power distribution network; u shape i 、U j The voltage amplitudes of the node i and the node j are obtained; g ij 、B ij Respectively a conductance and a susceptance; theta.theta. ij Is the phase angle difference between the node i and the node j;
the power balance constraints are as follows:
Figure BDA0003843564830000194
wherein M is r Is the number of loads in the region n.
Figure BDA0003843564830000195
Is the power of the load r over the time period t. The formula represents the injected power of the whole system, namely the sum of all renewable energy output power and thermal power unit output power, and the power consumed by all loads needs to be satisfied.
2. The inequality constrains:
the inequality constraint of the system mainly aims at ensuring the safe and stable operation of the system and meets the actual operation condition of the unit. The upper limit of the branch current is mainly restricted; limiting the upper limit and the lower limit of the node voltage; the method comprises the following steps of thermal power unit output power upper and lower limit constraint, start-up and shutdown time constraint, climbing constraint and rotation standby constraint; and (4) limiting the upper limit and the lower limit of the output power of the renewable energy source unit.
The upper limit of the branch current is constrained as follows:
Figure BDA0003843564830000201
wherein, I ij The magnitude of the current flowing on branch ij,
Figure BDA0003843564830000202
the maximum amplitude of the current allowed to flow on branch ij.
The upper and lower limits of the node voltage are constrained as follows:
Figure BDA0003843564830000203
wherein the content of the first and second substances,
Figure BDA0003843564830000204
U i respectively representing the upper and lower limits of the voltage at node i.
And (3) upper and lower limit constraint of output power of the thermal power generating unit:
Figure BDA0003843564830000205
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003843564830000206
P t i and respectively representing the upper limit and the lower limit of the output power of the thermal power generating unit i.
And (3) constraint of the starting and stopping time of the thermal power generating unit:
Figure BDA0003843564830000207
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003843564830000208
the starting time, the stopping time, the minimum starting time and the minimum stopping time of the thermal power generating unit i.
Thermal power generating unit climbing restraint:
Figure BDA0003843564830000209
wherein, U t 、D n And the maximum climbing and descending speed of the active power of the thermal power generating unit i in unit time.
Rotating standby constraint of a thermal power generating unit:
Figure BDA00038435648300002010
wherein the content of the first and second substances,
Figure BDA00038435648300002011
is node r negative at time tThe charge reserve capacity is typically 5% of the total load. R w And predicting an error for the output of the renewable energy unit j. The first expression shows that the upper limit of the output power of all the units is larger than the sum of all the loads and the maximum positive error. The second expression shows that the lower limit of the output power of all the units is smaller than the difference between all the loads and the maximum negative error.
The upper and lower limits of the output power of the renewable energy source unit are restricted:
Figure BDA0003843564830000211
wherein the content of the first and second substances,
Figure BDA0003843564830000212
and the upper limit and the lower limit of the output power of the renewable energy source unit j.
And then based on basic elements of multi-agent deep reinforcement learning, designing a multi-region collaborative optimization model containing high-proportion renewable energy sources into a reinforcement learning model under a multi-agent environment according to a state space, an action space, an environment and a reward function:
1. state space
The state space of the system mainly comprises load data p of each intelligent agent region L Actual output power of renewable energy
Figure BDA0003843564830000213
Renewable energy output power upper limit
Figure BDA0003843564830000214
Actual output power of thermal power generating unit
Figure BDA0003843564830000215
The cost coefficient comprises a thermal power generating unit operation cost coefficient c o Cost coefficient of wind and light abandoning c g Cost coefficient of starter SU i Shutdown cost factor SD i Node voltage U i I.e. by
Figure BDA0003843564830000216
2. Movement space
The action space variable corresponds to the control variable of the system to be researched, each regional power grid in the system is used as an intelligent agent, and the action space variable comprises the output power of the thermal power generating unit according to the constraint of the intelligent agent
Figure BDA0003843564830000217
Start-stop of thermal power generating unit
Figure BDA0003843564830000218
Renewable energy output power
Figure BDA0003843564830000219
Namely, it is
Figure BDA00038435648300002110
3. Design of environment
The Actor of each agent takes action according to the state at the moment, interacts with the environment, obtains reward and transfers to the state at the next moment, and Critic evaluates the action and guides the agent to take action at the next moment. According to the process, the multi-region power grid models (6) - (14) are used as the environment, once power grid load flow calculation is carried out after each intelligent agent takes action at each moment, the relevant state quantity of the power grid is fed back to be used for calculating the reward function, the process is transferred to the next moment, and the process is repeated.
4. Reward function
The reward function affects the convergence of the algorithm to some extent, so that the setting of the reward signal needs to be able to deliver the desired goal of the agent, thereby directing the agent to improve the action towards maximizing the reward function. And taking the opposite number of the objective function of the multi-region power grid model as the instant reward of each intelligent agent. In the optimization problem, corresponding constraint conditions are also required to be met, and according to the constraint conditions provided by the invention, if correspondingIf the variable does not satisfy the constraint, a penalty value r is set push And the instant prize together as the final prize function for the agent, is calculated as follows:
R={F+r push } (17)
and finally, solving by adopting an MADDPG method according to the multi-area power grid collaborative optimization model based on the multi-agent deep reinforcement learning. In the MADDPG method, the Actor of the agent i only needs to acquire the state information s related to the Actor i ,a i Actions taken for agent i, r i For awarding, θ i Is a weight parameter of itself. Is provided with n agents, and an observation set x =(s) 1 ,...,s N ) Is the state information of all agents. The Actor continuously updates the self parameter theta i To maximize the expectation of the own prize, i.e., the Critic evaluation value is higher. The policy update rule of the Actor is as follows:
Figure BDA0003843564830000221
in the formula (I), the compound is shown in the specification,
Figure BDA0003843564830000222
for each agent, as a centralized state-action value function
Figure BDA0003843564830000223
Independently learned and updated. D = (x, x', a) 1 ,...,a N ,r 1 ,...,r N ) To store all the agent experience playback units, a random set of them is drawn for training each time. To solve the problem that the intelligent agent is not easy to converge due to selecting a certain action by action distribution in a continuous space, the continuous action
Figure BDA0003843564830000224
The MADDPG adopts a continuous determination strategy set mu of n intelligent agents; for 0-1 motion
Figure BDA0003843564830000225
And random values are taken in the training stage.
Critic updates its parameters primarily by minimizing the time difference error, and its loss function is:
Figure BDA0003843564830000226
in the formula (I), the compound is shown in the specification,
Figure BDA0003843564830000227
mu' is the strategy set of the target network, gamma belongs to [0,1 ]]Is a discount factor.
Finally, the target network periodically copies the parameters from the evaluation network in a soft update mode according to the following rules:
θ′ i =(1-τ)θ′ i +τθ i (20)
in formula (II), theta' i Is the target network parameter of the agent i, tau is the soft update coefficient, and tau is less than or equal to 1.
In a multi-agent environment, agents interact with the environment simultaneously, resulting in an environment that is unstable for each agent. MADDPG proposes a strategy integration method, which enables the strategy mu of an agent i i With a set of K sub-strategies, only one sub-strategy μ being used in each training round i (k) The overall reward of the set of strategies is made highest throughout the training process. Therefore, the final policy update of Actor is:
Figure BDA0003843564830000231
referring to fig. 3, according to the solution principle of the madpg method, the overall algorithm flow of the multi-region power grid collaborative optimization model based on the madpg according to the embodiment of the present invention is as follows:
the method comprises the following steps: each agent synchronizes and initializes parameters. Setting an optimized scheduling time period T and the number M of training rounds of each agent, setting initial values to be 1, and randomly setting network parameters of the agents at initial timeθ i
Step two: the environment is initialized. And loading the multi-region power grid collaborative optimization model into an environment, setting interface files of states and actions in the MADDPG algorithm so as to carry out load flow calculation according to the states and actions in real time and feed back corresponding environment state quantities.
Step three: each agent interacts with the environment. And each agent observes the state quantity of the region and takes action, interacts with the environment and feeds back the state quantity to calculate the reward, and the agent carries out state transition to enter the next moment, observes the state of the next moment and stores (x, a, r and x') in the experience playback unit D.
Step four: and updating each network parameter. Randomly sampling k strategy next group (x) from empirical playback unit D (k) ,a (k) ,r (k) ,x′ (k) ) The critic and operator parameters are updated according to equations (17), (19), and the target network parameters are updated according to equation (18).
Step five: and judging whether the current training round number M reaches a set value M, if so, ending the training, outputting and storing a result, and if not, returning to the step two, and restarting a new round of training.
The model is programmed and solved by using simulation software, and compared with centralized optimization solution, the trained model is verified to have better capability of coping with uncertainty, and can realize real-time decision making of a system, thereby being beneficial to online application; compared with a single agent deep reinforcement learning method, the MADDPG method has the advantages of high convergence rate, good training result and obvious response to complex environment problems.
Example 2
Referring to fig. 4, a multi-region power grid collaborative optimization system according to an embodiment of the present invention includes:
the observation data collection module 2 is used for collecting the observation data of each intelligent agent region in the multi-region power grid to be optimized;
the collaborative optimization model building module 3 is used for building a multi-region power grid collaborative optimization model containing renewable energy sources on the basis of the observation data;
the reinforcement learning model design module 4 is used for designing the multi-region power grid collaborative optimization model containing renewable energy sources into a reinforcement learning model under a multi-agent environment according to a state space, an action space, an environment and a reward function;
and the model solving module 5 is used for solving the reinforcement learning model in the multi-agent environment and outputting a collaborative optimization result to carry out collaborative optimization on the multi-region power grid.
In a possible implementation manner, the system further comprises an agent partitioning module 1, configured to perform agent partitioning on the power grid;
when the intelligent agent is divided into the power grids, the node standard system is divided into different areas, different intelligent agents are set according to the areas, and the area intelligent agents are used as decision centers to realize the collaborative optimization operation of the multi-area power grids.
In one possible embodiment, the observation data collected by the observation data collecting module 2 for each agent region includes: load data p L Actual output power of renewable energy
Figure BDA0003843564830000241
Actual output power of thermal power generating unit
Figure BDA0003843564830000242
And the operating cost coefficient c of the thermal power generating unit o Cost coefficient c of abandoned wind and abandoned light g Cost coefficient of starter SU i Shutdown cost factor SD i
In one possible embodiment, the collaborative optimization model building module 3 builds the following objective function:
Figure BDA0003843564830000243
wherein N is the number of divided regions,
Figure BDA0003843564830000244
in order to reduce the cost of the thermal power generating unit in the region n,
Figure BDA0003843564830000245
punishment of abandoned wind and abandoned light for the renewable energy unit of the region n; wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003843564830000246
and
Figure BDA0003843564830000247
the calculation expressions of (a) are respectively as follows:
Figure BDA0003843564830000251
in the formula, M k The number of the thermoelectric generator sets in the region n is set; t is the calculated time length;
Figure BDA0003843564830000252
the method is characterized in that the method is an operation state of a thermal power generating unit i in a time period t:
Figure BDA0003843564830000253
it indicates that the thermal power generating unit i is operated,
Figure BDA0003843564830000254
indicating that the thermal power generating unit i is shut down; c. C o The operation cost coefficient of the thermal power generating unit is obtained;
Figure BDA0003843564830000255
the power output by the thermal power generating unit i in the time period t; Δ t is an operating time interval; SU i The cost for starting the thermal power generating unit i once;
Figure BDA0003843564830000256
the shutdown cost of the thermal power generating unit i is calculated;
Figure BDA0003843564830000257
the shutdown time of the thermal power generating unit i is shown;
Figure BDA0003843564830000258
wherein M is n The number of renewable energy units in the region n; t is the calculation duration; c. C g A light abandoning penalty coefficient for abandoning wind;
Figure BDA0003843564830000259
the actual output power of the renewable energy source unit j in the time period t is obtained;
Figure BDA00038435648300002510
and (4) setting the upper limit of the output power of the renewable energy source set j in the time period t.
In a possible implementation manner, the collaborative optimization model building module 3 builds equality constraints and inequality constraints on the objective function;
equality constraints including network power flow constraints, power balance constraints;
the expression of the network flow constraint is as follows:
Figure BDA00038435648300002511
Figure BDA00038435648300002512
in the formula, P G,i 、Q G,i Active and reactive power injected from the node i; p is D,i 、Q D,i Active and reactive power consumed for the load of the node i; n is a radical of hydrogen e Counting the nodes of the power distribution network; u shape i 、U j The voltage amplitudes of the node i and the node j are obtained; g ij 、B ij Respectively a conductance and a susceptance; theta.theta. ij Is the phase angle difference between the node i and the node j;
the expression of the power balance constraint is as follows:
Figure BDA00038435648300002513
in the formula, M r For the number of loads in the region n,
Figure BDA00038435648300002514
power for load r over time period t;
inequality constraints comprising upper limit constraint of branch current, upper and lower limit constraint of node voltage, upper and lower limit constraint of output power of a thermal power generating unit, startup and shutdown time constraint of the thermal power generating unit, climbing constraint of the thermal power generating unit, rotating standby constraint of the thermal power generating unit and upper and lower limit constraint of output power of a renewable energy source unit;
the expression of the branch current upper limit constraint is as follows:
Figure BDA0003843564830000261
in the formula I ij The magnitude of the current flowing on branch ij,
Figure BDA0003843564830000262
is the maximum amplitude of the current allowed to flow on branch ij;
the expression of the upper and lower node voltage limits constraint is as follows:
Figure BDA0003843564830000263
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003843564830000264
U i respectively representing the upper limit and the lower limit of the voltage of the node i;
the thermal power unit output power upper and lower limit constraint expression is as follows:
Figure BDA0003843564830000265
wherein the content of the first and second substances,
Figure BDA0003843564830000266
P t i respectively representing the upper limit and the lower limit of the output power of the thermal power generating unit i;
the expression of the thermal power generating unit on-off time constraint is as follows:
Figure BDA0003843564830000267
in the formula (I), the compound is shown in the specification,
Figure BDA0003843564830000268
starting time, stopping time, minimum starting time and minimum stopping time of the thermal power generating unit i;
the climbing constraint expression of the thermal power generating unit is as follows:
Figure BDA0003843564830000269
in the formula of U t 、D n The maximum active power climbing and descending speed of the thermal power generating unit i in unit time;
the expression of the rotating standby constraint of the thermal power generating unit is as follows:
Figure BDA0003843564830000271
Figure BDA0003843564830000272
in the formula (I), the compound is shown in the specification,
Figure BDA0003843564830000273
for the load reserve capacity, R, of node R at time t w Predicting an error for the output of the renewable energy unit j; the expression of the upper and lower limits of the output power of the renewable energy source set is as follows:
Figure BDA0003843564830000274
in the formula (I), the compound is shown in the specification,
Figure BDA0003843564830000275
and the upper limit and the lower limit of the output power of the renewable energy source unit j.
In a possible implementation manner, the reinforcement learning model design module 4 is used for designing the renewable energy multi-region power grid collaborative optimization model into the reinforcement learning model in the multi-agent environment according to the state space, the action space, the environment and the reward function, wherein the state space variables comprise the load data p of each agent region L Actual output power of renewable energy
Figure BDA0003843564830000276
Renewable energy output power upper limit
Figure BDA0003843564830000277
Actual output power of thermal power generating unit
Figure BDA0003843564830000278
Operating cost coefficient c of thermal power generating unit o Cost coefficient of wind and light abandoning c g Cost coefficient of starter SU i Shutdown cost factor SD i And node voltage U i (ii) a The expression of the state space is as follows:
Figure BDA0003843564830000279
the action space variables include: output power of thermal power generating unit
Figure BDA00038435648300002710
Start-stop of thermal power generating unit
Figure BDA00038435648300002711
Renewable energy output power
Figure BDA00038435648300002712
The expression of the motion space is as follows:
Figure BDA00038435648300002713
in a multi-region power grid collaborative optimization model containing renewable energy sources, equality constraints and inequality constraints established for the objective function are used as environments, once power grid flow calculation is carried out after each intelligent agent takes action at each moment, relevant state quantities of a feedback power grid are used for calculating a reward function, and the calculation is transferred to the next moment, and the steps are carried out in a circulating manner;
in a multi-region power grid collaborative optimization model containing renewable energy sources, the opposite number of a target function is used as the instant reward of each intelligent agent; according to the equality constraint and the inequality constraint established for the target function, if the corresponding variable does not satisfy the constraint, a penalty value r is set push And as a final reward function for the agent along with the instant reward, the expression is as follows:
R={F+r push }。
in one possible implementation, the model solving module 5 adopts a madpg algorithm in which the Actor of agent i obtains its own relevant state information s to solve the reinforcement learning model in the multi-agent environment i ,a i Actions taken for agent i, r i For awarding, θ i For the weight parameter of the agent, n agents are arranged, and the observation set x =(s) 1 ,...,s N ) Status information for all agents; the Actor continuously updates the self parameter theta i To maximize the expectation of the reward itself, i.e., the Critic evaluation is higher;
the policy update rule expression of Actor is as follows:
Figure BDA0003843564830000281
in the formula (I), the compound is shown in the specification,
Figure BDA0003843564830000282
for each agent, as a centralized state-action value function
Figure BDA0003843564830000283
Independently learning and updating; d = (x, x', a) 1 ,...,a N ,r 1 ,...,r N ) Randomly selecting one group from the playback units for storing all the experiences of the intelligent agents to train during each training; for continuous motion
Figure BDA0003843564830000284
The MADDPG algorithm adopts the continuous determination strategy set mu of n agents; for 0-1 motion
Figure BDA0003843564830000285
Random values are taken in the training stage;
critic updates its parameters by minimizing the time difference error, and the loss function expression of Critic is:
Figure BDA0003843564830000286
in the formula (I), the compound is shown in the specification,
Figure BDA0003843564830000287
mu' is the strategy set of the target network, and gamma belongs to [0,1 ]]Is a discount factor;
the target network periodically copies parameters from the evaluation network as follows:
θ′ i =(1-τ)θ′ i +τθ i
in the formula, theta i ' is a target network parameter of an agent i, tau is a soft update coefficient, and tau is less than or equal to 1;
let strategy μ of agent i i With a set of K sub-strategies, only one sub-strategy mu being used in each training round i (k) Make strategy during the whole training processThe overall reward of the rough set is the highest, and the final strategy update of the Actor is as follows:
Figure BDA0003843564830000291
in one possible embodiment, the step of the model solving module 5 adopting the madpg algorithm to solve the reinforcement learning model in the multi-agent environment specifically includes:
setting an optimized scheduling time period T and the number M of training rounds of each agent, setting initial values to be 1, and randomly setting an agent network parameter theta at initial time i
Loading the multi-region power grid collaborative optimization model into an environment, setting an interface file of states and actions in an MADDPG algorithm, so that load flow calculation can be performed in real time according to the state actions, and feeding back corresponding environment state quantities;
each agent observes the state quantity of the region and takes action, interacts with the environment and feeds back the state quantity to calculate the reward, the agent carries out state transition to enter the next moment, observes the state of the next moment and stores (x, a, r, x') into an experience playback unit D;
randomly sampling k strategy next group (x) from empirical playback unit D (k) ,a (k) ,r (k) ,x′ (k) ) Updating critic and actor parameters and updating target network parameters;
and judging whether the number M of the current training rounds reaches a set value M, if so, ending the training, outputting and storing a result, and if not, restarting a new round of training.
Example 3
Another embodiment of the present invention further provides an electronic device, including:
a memory storing at least one instruction; and
and the processor executes the instructions stored in the memory to realize the multi-region power grid collaborative optimization method.
Example 4
Another embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the method for collaborative optimization of a multi-region power grid according to the present invention is implemented.
The computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable storage medium may include: any entity or device capable of carrying said computer program code, media, usb disk, removable hard disk, magnetic diskette, optical disk, computer memory, read-only memory, random access memory, electrical carrier wave signals, telecommunication signals, software distribution media, etc. It should be noted that the computer-readable medium may contain suitable additions or subtractions depending on the requirements of legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer-readable media may not include electrical carrier signals or telecommunication signals in accordance with legislation and patent practice. For convenience of explanation, the above description only shows the relevant parts of the embodiments of the present invention, and the detailed technical details are not disclosed, please refer to the method parts of the embodiments of the present invention. The computer-readable storage medium is non-transitory, and may be stored in a storage device formed by various electronic devices, and is capable of implementing the execution process described in the method of the embodiment of the present invention.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.

Claims (18)

1. A multi-region grid collaborative optimization method is characterized by comprising the following steps:
collecting observation data of each intelligent agent region in a multi-region power grid to be optimized;
constructing a multi-region power grid collaborative optimization model containing renewable energy sources on the basis of the observation data;
designing the multi-region power grid collaborative optimization model containing renewable energy sources into a reinforcement learning model under a multi-agent environment according to a state space, an action space, an environment and a reward function;
and solving the reinforcement learning model under the multi-agent environment, and outputting a cooperative optimization result to perform cooperative optimization on the multi-region power grid.
2. The multi-region power grid collaborative optimization method according to claim 1, further comprising a step of dividing a power grid into different regions, wherein when the power grid is divided into different regions, the different regions are set as different intelligent agents according to the regions, and the region intelligent agents are used as decision centers to realize collaborative optimization operation of the multi-region power grid.
3. The multi-region grid collaborative optimization method according to claim 1, wherein in the step of collecting observation data of each smart region in the multi-region grid to be optimized, the observation data of each smart region includes:
load data p L Renewable energy source actual output power
Figure FDA0003843564820000011
Actual output power of thermal power generating unit
Figure FDA0003843564820000012
And the operating cost coefficient c of the thermal power generating unit o Cost coefficient of wind and light abandoning c g Cost coefficient of starter SU i Shutdown cost factor SD i
4. The multi-region grid collaborative optimization method according to claim 3, wherein in the step of constructing the multi-region grid collaborative optimization model including renewable energy, the following objective function is established:
Figure FDA0003843564820000013
wherein N is the number of divided regions,
Figure FDA0003843564820000014
in order to reduce the cost of the thermal power generating unit in the region n,
Figure FDA0003843564820000015
a wind and light abandoning punishment is carried out on the wind and light abandoning of the renewable energy source unit in the region n; wherein, the first and the second end of the pipe are connected with each other,
Figure FDA0003843564820000016
and
Figure FDA0003843564820000017
respectively as follows:
Figure FDA0003843564820000018
in the formula, M k The number of the fire-electricity generating sets in the region n; t is the calculated time length;
Figure FDA0003843564820000019
the method is characterized in that the method is an operation state of a thermal power generating unit i in a time period t:
Figure FDA00038435648200000110
it indicates that the thermal power generating unit i is operated,
Figure FDA00038435648200000111
indicating that the thermal power generating unit i is shut down; c. C o The operation cost coefficient of the thermal power generating unit is obtained;
Figure FDA00038435648200000112
the power output by the thermal power generating unit i in the time period t; Δ t is an operating time interval; SU i The cost for starting the thermal power generating unit i once;
Figure FDA0003843564820000021
the shutdown cost of the thermal power generating unit i is reduced;
Figure FDA0003843564820000022
the shutdown time of the thermal power generating unit i is shown;
Figure FDA0003843564820000023
wherein, M n The number of renewable energy source units in the region n; t is the calculation duration; c. C g A light abandoning penalty coefficient for wind abandon;
Figure FDA0003843564820000024
actual output power of the renewable energy source unit j in the time period t;
Figure FDA0003843564820000025
and (4) the output power of the renewable energy set j in the time period t is limited.
5. The multi-region grid collaborative optimization method according to claim 4, wherein in the step of constructing the multi-region grid collaborative optimization model including renewable energy, equality constraints and inequality constraints are established for the objective function;
equality constraints including network power flow constraints and power balance constraints;
the expression of the network power flow constraint is as follows:
Figure FDA0003843564820000026
Figure FDA0003843564820000027
in the formula, P G,i 、Q G,i Active and reactive power injected from the node i; p D,i 、Q D,i Active and reactive power consumed for the load of the node i; n is a radical of hydrogen e Counting the nodes of the power distribution network; u shape i 、U j The voltage amplitudes of the node i and the node j are obtained; g ij 、B ij Respectively conductance and susceptance; theta.theta. ij Is the phase angle difference between the node i and the node j;
the expression of the power balance constraint is as follows:
Figure FDA0003843564820000028
in the formula, M r As the number of loads in the region n,
Figure FDA0003843564820000029
power for load r over time period t;
inequality constraints comprising branch current upper limit constraints, node voltage upper and lower limit constraints, thermal power unit output power upper and lower limit constraints, thermal power unit startup and shutdown time constraints, thermal power unit climbing constraints, thermal power unit rotation standby constraints and renewable energy unit output power upper and lower limit constraints;
the expression of the branch current upper limit constraint is as follows:
Figure FDA0003843564820000031
in the formula I ij The magnitude of the current flowing on branch ij,
Figure FDA0003843564820000032
is the maximum amplitude of the current allowed to flow on branch ij;
the expression of the upper and lower node voltage limit constraints is as follows:
Figure FDA0003843564820000033
wherein the content of the first and second substances,
Figure FDA0003843564820000034
U i respectively representing the upper limit and the lower limit of the voltage of the node i;
the expression of the upper and lower limit constraints of the output power of the thermal power generating unit is as follows:
Figure FDA0003843564820000035
wherein, the first and the second end of the pipe are connected with each other,
Figure FDA0003843564820000036
respectively representing the upper limit and the lower limit of the output power of the thermal power generating unit i;
the thermal power generating unit on-off time constraint expression is as follows:
Figure FDA0003843564820000037
in the formula (I), the compound is shown in the specification,
Figure FDA0003843564820000038
starting time, stopping time, minimum starting time and minimum stopping time of the thermal power generating unit i;
the climbing constraint expression of the thermal power generating unit is as follows:
Figure FDA0003843564820000039
in the formula of U t 、D n The maximum climbing and descending speed of active power of the thermal power generating unit i in unit time;
the thermal power generating unit rotation standby constraint expression is as follows:
Figure FDA00038435648200000310
Figure FDA00038435648200000311
in the formula (I), the compound is shown in the specification,
Figure FDA00038435648200000312
for the load reserve capacity, R, of node R at time t w Predicting an error for the output of the renewable energy unit j;
the expression of the upper and lower limits of the output power of the renewable energy source set is as follows:
Figure FDA0003843564820000041
in the formula (I), the compound is shown in the specification,
Figure FDA0003843564820000042
and the upper limit and the lower limit of the j output power of the renewable energy source unit.
6. The multi-region power grid collaborative optimization method according to claim 1, wherein in the step of designing the multi-region power grid collaborative optimization model containing renewable energy sources into the reinforcement learning model in the multi-agent environment according to the state space, the action space, the environment and the reward function, the state space variables comprise the load data p of each agent region L Renewable energy source actual output power
Figure FDA0003843564820000043
Renewable energy output power upper limit
Figure FDA0003843564820000044
Actual output power of thermal power generating unit
Figure FDA0003843564820000045
Operating cost coefficient c of thermal power generating unit o Cost coefficient of wind and light abandoning c g Cost coefficient of starter SU i Shutdown cost factor SD i And node voltage U i (ii) a The expression of the state space is as follows:
Figure FDA0003843564820000046
the action space variables include: output power of thermal power generating unit
Figure FDA0003843564820000047
Start and stop of thermal power generating unit
Figure FDA0003843564820000048
Renewable energy output power
Figure FDA0003843564820000049
The expression of the motion space is as follows:
Figure FDA00038435648200000410
in a multi-region power grid collaborative optimization model containing renewable energy, equality constraint and inequality constraint established for the objective function are taken as an environment, power flow calculation is performed once after each intelligent agent takes action at each moment, relevant state quantity of a feedback power grid is used for calculating a reward function, and the calculation is transferred to the next moment, and the steps are performed in a circulating manner;
in a multi-region power grid collaborative optimization model containing renewable energy sources, the opposite number of the objective function is used as the instant reward of each intelligent agent(ii) a According to the equality constraint and the inequality constraint established for the target function, if the corresponding variable does not satisfy the constraint, a penalty value r is set push And the real-time reward is taken as a final reward function of the intelligent agent, and the expression is as follows:
R={F+r push }。
7. the multi-region power grid collaborative optimization method according to claim 1, wherein in the step of solving the reinforcement learning model in the multi-agent environment, a MADDPG algorithm is adopted for solving, and in the MADDPG algorithm, an Actor of an agent i obtains state information s related to the Actor i ,a i Actions taken for agent i, r i For awarding, θ i For the weight parameter of the agent, n agents are arranged, and the observation set x =(s) 1 ,...,s N ) Status information for all agents; the Actor continuously updates the self parameter theta i To maximize the expectation of the reward itself, i.e., the Critic evaluation is higher;
the policy update rule expression of Actor is as follows:
Figure FDA0003843564820000051
in the formula (I), the compound is shown in the specification,
Figure FDA0003843564820000052
for each agent, a centralized state-action value function
Figure FDA0003843564820000053
Independently learning and updating; d = (x, x', a) 1 ,...,a N ,r 1 ,...,r N ) Randomly selecting one group from the playback units for storing all the experiences of the intelligent agents to train during each training; for continuous motion
Figure FDA0003843564820000054
MADDThe PG algorithm adopts a continuous determination strategy set mu of n agents; for 0-1 motion
Figure FDA0003843564820000055
Random values are taken in the training stage;
critic updates its parameters by minimizing the time difference error, and the loss function expression of Critic is:
Figure FDA0003843564820000056
in the formula (I), the compound is shown in the specification,
Figure FDA0003843564820000057
mu' is the strategy set of the target network, gamma belongs to [0,1 ]]Is a discount factor;
the target network periodically copies parameters from the evaluation network as follows:
θ′ i =(1-τ)θ′ i +τθ i
in the formula, theta i ' is a target network parameter of an agent i, tau is a soft update coefficient, and tau is less than or equal to 1;
let strategy μ of agent i i With a set of K sub-strategies, only one sub-strategy mu being used in each training round i (k) In the whole training process, the overall reward of the strategy set is the highest, and the final strategy of the Actor is updated as follows:
Figure FDA0003843564820000058
8. the multi-region power grid collaborative optimization method according to claim 7, wherein the step of solving the reinforcement learning model in the multi-agent environment by using a MADDPG algorithm specifically comprises:
setting an optimized scheduling time period T and setting the number of training rounds M of each agent, wherein the initial values are all set to 1, and the initial values areTime-random setting of intelligent agent network parameter theta i
Loading the multi-region power grid collaborative optimization model into an environment, setting an interface file of states and actions in an MADDPG algorithm, so that load flow calculation can be performed in real time according to the state actions, and feeding back corresponding environment state quantities;
each agent observes the state quantity of the region and takes action, interacts with the environment and feeds back the state quantity to calculate the reward, the agent carries out state transition to enter the next moment, observes the state of the next moment and stores (x, a, r, x') into an experience playback unit D;
randomly sampling k strategy next group (x) from empirical playback unit D (k) ,a (k) ,r (k) ,x′ ( k ) ) Updating critic and actor parameters and updating target network parameters;
and judging whether the number M of the current training rounds reaches a set value M, if so, ending the training, outputting and storing a result, and if not, restarting a new round of training.
9. A multi-region grid collaborative optimization system, comprising:
the observation data collection module is used for collecting the observation data of each intelligent agent region in the multi-region power grid to be optimized;
the collaborative optimization model building module is used for building a multi-region power grid collaborative optimization model containing renewable energy sources on the basis of the observation data;
the reinforcement learning model design module is used for designing the multi-region power grid collaborative optimization model containing the renewable energy into a reinforcement learning model under a multi-agent environment according to a state space, an action space, an environment and a reward function;
and the model solving module is used for solving the reinforcement learning model in the multi-agent environment and outputting a collaborative optimization result to carry out collaborative optimization on the multi-area power grid.
10. The multi-region power grid collaborative optimization system according to claim 9, further comprising an agent partitioning module configured to partition an electric grid agent;
when the power grid is divided into different zones, the node standard system is divided into different zones, different intelligent agents are set according to the zones, and the zone intelligent agents are used as decision centers, so that collaborative optimization operation of the multi-zone power grid is achieved.
11. The multi-region grid collaborative optimization system according to claim 9, wherein the observation data of each agent region collected by the observation data collection module includes:
load data p L Renewable energy source actual output power
Figure FDA0003843564820000071
Actual output power of thermal power generating unit
Figure FDA0003843564820000072
And the operating cost coefficient c of the thermal power generating unit o Cost coefficient c of abandoned wind and abandoned light g Cost coefficient of starter SU i Shutdown cost factor SD i
12. The multi-region grid collaborative optimization system according to claim 11, wherein the collaborative optimization model building module establishes the following objective function:
Figure FDA0003843564820000073
wherein N is the number of divided regions,
Figure FDA0003843564820000074
in order to reduce the cost of the thermal power generating unit in the region n,
Figure FDA0003843564820000075
abandon wind and abandon light punishment for regional n renewable energy unitPenalty; wherein the content of the first and second substances,
Figure FDA0003843564820000076
and
Figure FDA0003843564820000077
respectively as follows:
Figure FDA0003843564820000078
in the formula, M k The number of the fire-electricity generating sets in the region n; t is the calculation duration;
Figure FDA0003843564820000079
the method is characterized in that the method is an operation state of a thermal power generating unit i in a time period t:
Figure FDA00038435648200000710
the operation of the thermal power generating unit i is shown,
Figure FDA00038435648200000711
indicating that the thermal power generating unit i is shut down; c. C o The coefficient is the operation cost coefficient of the thermal power generating unit;
Figure FDA00038435648200000712
the power output by the thermal power generating unit i in the time period t; Δ t is an operating time interval; SU i The cost for starting the thermal power generating unit i once;
Figure FDA00038435648200000713
the shutdown cost of the thermal power generating unit i is calculated;
Figure FDA00038435648200000714
the shutdown time of the thermal power generating unit i is shown;
Figure FDA00038435648200000715
wherein M is n The number of renewable energy units in the region n; t is the calculation duration; c. C g A light abandoning penalty coefficient for abandoning wind;
Figure FDA00038435648200000716
the actual output power of the renewable energy source unit j in the time period t is obtained;
Figure FDA00038435648200000717
and (4) the output power of the renewable energy set j in the time period t is limited.
13. The multi-region grid collaborative optimization system according to claim 12, wherein the collaborative optimization model building module establishes equality constraints and inequality constraints on the objective function;
equality constraints including network power flow constraints, power balance constraints;
the expression of the network flow constraint is as follows:
Figure FDA0003843564820000081
Figure FDA0003843564820000082
in the formula, P G,i 、Q G,i Active and reactive power injected from the node i; p D,i 、Q D,i Active and reactive power consumed for the load of the node i; n is a radical of e Counting the nodes of the power distribution network; u shape i 、U j The voltage amplitudes of the node i and the node j are obtained; g ij 、B ij Respectively conductance and susceptance; theta ij Is the phase angle difference between the node i and the node j;
the expression of the power balance constraint is as follows:
Figure FDA0003843564820000083
in the formula, M r For the number of loads in the region n,
Figure FDA0003843564820000084
power for load r over time period t;
inequality constraints comprising upper limit constraint of branch current, upper and lower limit constraint of node voltage, upper and lower limit constraint of output power of a thermal power generating unit, startup and shutdown time constraint of the thermal power generating unit, climbing constraint of the thermal power generating unit, rotating standby constraint of the thermal power generating unit and upper and lower limit constraint of output power of a renewable energy source unit;
the expression of the branch current upper limit constraint is as follows:
Figure FDA0003843564820000085
in the formula I ij The magnitude of the current flowing on branch ij,
Figure FDA0003843564820000086
is the maximum amplitude of the current allowed to flow on branch ij;
the expression of the upper and lower node voltage limit constraints is as follows:
Figure FDA0003843564820000087
wherein, the first and the second end of the pipe are connected with each other,
Figure FDA0003843564820000088
U i respectively representing the upper limit and the lower limit of the voltage of the node i;
the expression of the upper and lower limit constraints of the output power of the thermal power generating unit is as follows:
Figure FDA0003843564820000089
wherein the content of the first and second substances,
Figure FDA00038435648200000810
respectively representing the upper limit and the lower limit of the output power of the thermal power generating unit i;
the thermal power generating unit on-off time constraint expression is as follows:
Figure FDA0003843564820000091
in the formula (I), the compound is shown in the specification,
Figure FDA0003843564820000092
starting time, stopping time, minimum starting time and minimum stopping time of the thermal power generating unit i;
the thermal power generating unit climbing constraint expression is as follows:
Figure FDA0003843564820000093
in the formula of U t 、D n The maximum climbing and descending speed of active power of the thermal power generating unit i in unit time;
the expression of the rotating standby constraint of the thermal power generating unit is as follows:
Figure FDA0003843564820000094
Figure FDA0003843564820000095
in the formula (I), the compound is shown in the specification,
Figure FDA0003843564820000096
reserve capacity for node r load at time t,R w Predicting an error for the output of the renewable energy unit j;
the expression of the upper and lower limits of the output power of the renewable energy source set is as follows:
Figure FDA0003843564820000097
in the formula (I), the compound is shown in the specification,
Figure FDA0003843564820000098
and the upper limit and the lower limit of the output power of the renewable energy source unit j.
14. The multi-region grid collaborative optimization system according to claim 9, wherein the reinforcement learning model design module is configured to design the renewable energy-containing multi-region grid collaborative optimization model as a reinforcement learning model in a multi-agent environment according to a state space, an action space, an environment and a reward function, wherein the state space variables include load data p of each agent region L Actual output power of renewable energy
Figure FDA0003843564820000099
Renewable energy output power upper limit
Figure FDA00038435648200000910
Actual output power of thermal power generating unit
Figure FDA00038435648200000911
Operating cost coefficient c of thermal power generating unit o Cost coefficient c of abandoned wind and abandoned light g Cost coefficient of starter SU i Shutdown cost factor SD i And node voltage U i (ii) a The expression of the state space is as follows:
Figure FDA00038435648200000912
the action space variables include: output power of thermal power generating unit
Figure FDA00038435648200000913
Start-stop of thermal power generating unit
Figure FDA00038435648200000914
Renewable energy output power
Figure FDA0003843564820000101
The expression of the action space is as follows:
Figure FDA0003843564820000102
in a multi-region power grid collaborative optimization model containing renewable energy, equality constraint and inequality constraint established for the objective function are taken as an environment, power flow calculation is performed once after each intelligent agent takes action at each moment, relevant state quantity of a feedback power grid is used for calculating a reward function, and the calculation is transferred to the next moment, and the steps are performed in a circulating manner;
in a multi-region power grid collaborative optimization model containing renewable energy sources, the opposite number of a target function is used as the instant reward of each intelligent agent; according to the equality constraint and the inequality constraint established for the target function, if the corresponding variable does not satisfy the constraint, a penalty value r is set push And as a final reward function for the agent along with the instant reward, the expression is as follows:
R={F+r push }。
15. the multi-region grid collaborative optimization system according to claim 9, wherein the model solving module adopts a madpg algorithm in which an Actor of an agent i obtains its own relevant state information s to solve the reinforcement learning model in the multi-agent environment i ,a i Taken for agent iAction, r i For awarding, θ i For the weight parameter of the agent, n agents are arranged, and the observation set x =(s) 1 ,...,s N ) Status information for all agents; the Actor continuously updates the self parameter theta i To maximize the expectation of the own prize, i.e., the Critic evaluation value is higher;
the policy update rule expression of Actor is as follows:
Figure FDA0003843564820000103
in the formula (I), the compound is shown in the specification,
Figure FDA0003843564820000104
for each agent, as a centralized state-action value function
Figure FDA0003843564820000105
Independently learning and updating; d = (x, x', a) 1 ,...,a N ,r 1 ,...,r N ) Randomly selecting one group from the playback units for storing all the experiences of the intelligent agents to train during each training; for continuous motion
Figure FDA0003843564820000106
The MADDPG algorithm adopts a continuous determination strategy set mu of n agents; for 0-1 motion
Figure FDA0003843564820000107
Random values are taken in the training stage;
critic updates its parameters by minimizing the time difference error, and the loss function expression of Critic is:
Figure FDA0003843564820000111
in the formula (I), the compound is shown in the specification,
Figure FDA0003843564820000112
mu' is the strategy set of the target network, gamma belongs to [0,1 ]]Is a discount factor;
the target network periodically copies the parameters from the evaluation network, with the following rules:
θ′ i =(1-τ)θ′ i +τθ i
in the formula (II), theta' i Is a target network parameter of an agent i, tau is a soft update coefficient, and tau is less than or equal to 1;
let strategy μ of agent i i With a set of K sub-strategies, only one sub-strategy mu being used in each training round i (k) In the whole training process, the overall reward of the strategy set is the highest, and the final strategy of the Actor is updated as follows:
Figure FDA0003843564820000113
16. the multi-region grid collaborative optimization system according to claim 15, wherein the step of the model solution module using madpg algorithm to solve the reinforcement learning model in the multi-agent environment specifically comprises:
setting an optimized scheduling time period T and the number M of training rounds of each agent, setting initial values to be 1, and randomly setting an agent network parameter theta at initial time i
Loading the multi-region power grid collaborative optimization model into an environment, setting an interface file of states and actions in an MADDPG algorithm, so that load flow calculation can be performed in real time according to the state actions, and feeding back corresponding environment state quantities;
each agent observes the state quantity of the region and takes action, interacts with the environment and feeds back the state quantity to calculate the reward, the agent carries out state transition to enter the next moment, observes the state of the next moment and stores (x, a, r, x') into an experience playback unit D;
randomly sampling k strategy next group (x) from empirical playback unit D (k) ,a (k) ,r (k) ,x′ (k) ) Updating critic and actor parameters and updating target network parameters;
and judging whether the number M of the current training rounds reaches a set value M, if so, ending the training, outputting and storing a result, and if not, restarting a new round of training.
17. An electronic device, comprising:
a memory storing at least one instruction; and
a processor executing instructions stored in the memory to implement the multi-region grid collaborative optimization method of any of claims 1 to 8.
18. A computer-readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the multi-region grid collaborative optimization method according to any of claims 1 to 8.
CN202211109903.4A 2022-09-13 2022-09-13 Multi-region power grid collaborative optimization method, system, equipment and readable storage medium Pending CN115333111A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211109903.4A CN115333111A (en) 2022-09-13 2022-09-13 Multi-region power grid collaborative optimization method, system, equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211109903.4A CN115333111A (en) 2022-09-13 2022-09-13 Multi-region power grid collaborative optimization method, system, equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN115333111A true CN115333111A (en) 2022-11-11

Family

ID=83929270

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211109903.4A Pending CN115333111A (en) 2022-09-13 2022-09-13 Multi-region power grid collaborative optimization method, system, equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN115333111A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116436029A (en) * 2023-03-13 2023-07-14 华北电力大学 New energy station frequency control method based on deep reinforcement learning

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116436029A (en) * 2023-03-13 2023-07-14 华北电力大学 New energy station frequency control method based on deep reinforcement learning
CN116436029B (en) * 2023-03-13 2023-12-01 华北电力大学 New energy station frequency control method based on deep reinforcement learning

Similar Documents

Publication Publication Date Title
CN112615379B (en) Power grid multi-section power control method based on distributed multi-agent reinforcement learning
CN114725936B (en) Power distribution network optimization method based on multi-agent deep reinforcement learning
CN114091879A (en) Multi-park energy scheduling method and system based on deep reinforcement learning
Zhang et al. MOEA/D-based probabilistic PBI approach for risk-based optimal operation of hybrid energy system with intermittent power uncertainty
Xi et al. A virtual generation ecosystem control strategy for automatic generation control of interconnected microgrids
CN114243797A (en) Distributed power supply optimal scheduling method, system, equipment and storage medium
CN117057553A (en) Deep reinforcement learning-based household energy demand response optimization method and system
CN114123273A (en) Control method and system of wind power-photovoltaic-energy storage combined system
Dong et al. Optimal scheduling framework of electricity-gas-heat integrated energy system based on asynchronous advantage actor-critic algorithm
Ebell et al. Reinforcement learning control algorithm for a pv-battery-system providing frequency containment reserve power
CN115333111A (en) Multi-region power grid collaborative optimization method, system, equipment and readable storage medium
CN115795992A (en) Park energy Internet online scheduling method based on virtual deduction of operation situation
CN117039981A (en) Large-scale power grid optimal scheduling method, device and storage medium for new energy
CN115907232A (en) Regional comprehensive energy system cluster collaborative optimization method, system, equipment and medium
CN115313520A (en) Distributed energy system game optimization scheduling method, system, equipment and medium
Nourianfar et al. Economic emission dispatch considering electric vehicles and wind power using enhanced multi-objective exchange market algorithm
CN117439184A (en) Wind power station control method and system based on reinforcement learning
CN115133540B (en) Model-free real-time voltage control method for power distribution network
CN114285093B (en) Source network charge storage interactive scheduling method and system
CN116544995A (en) Cloud edge cooperation-based energy storage battery consistency charge and discharge control method and system
CN115579910A (en) Micro-grid frequency and voltage control method and terminal
CN114048576B (en) Intelligent control method for energy storage system for stabilizing power transmission section tide of power grid
CN115860180A (en) Power grid multi-time scale economic dispatching method based on consistency reinforcement learning algorithm
CN115115276A (en) Virtual power plant scheduling method and system considering uncertainty and privacy protection
CN116451880B (en) Distributed energy optimization scheduling method and device based on hybrid learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination